model.denoising.optimizer package

Submodules

model.denoising.optimizer.clip_grads module

Gradient clipping.

model.denoising.optimizer.clip_grads.clip_grad_norm_fp32(parameters, max_norm, norm_type=2)
Clips gradient norm of an iterable of parameters whose gradients

are in fp32.

This is adapted from torch.nn.utils.clip_grad.clip_grad_norm_ and added functionality to handle model parallel parameters. Note that the gradients are modified in place.

Parameters:
  • parameters (Iterable[Tensor] or Tensor) – an iterable of Tensors or a single Tensor that will have gradients normalized

  • max_norm (float or int) – max norm of the gradients

  • norm_type (float or int) – type of the used p-norm. Can be 'inf' for infinity norm.

Returns:

Total norm of the parameters (viewed as a single vector).

model.denoising.optimizer.clip_grads.count_zeros_fp32(parameters)

model.denoising.optimizer.grad_scaler module

Megatron grad scaler.

class model.denoising.optimizer.grad_scaler.ConstantGradScaler(initial_scale)

Bases: MegatronGradScaler

load_state_dict(state_dict)
state_dict()
update(found_inf)
class model.denoising.optimizer.grad_scaler.DynamicGradScaler(initial_scale, min_scale, growth_factor, backoff_factor, growth_interval, hysteresis)

Bases: MegatronGradScaler

load_state_dict(state_dict)
state_dict()
update(found_inf)
class model.denoising.optimizer.grad_scaler.MegatronGradScaler(initial_scale)

Bases: ABC

property inv_scale
abstract load_state_dict(state_dict)
property scale
abstract state_dict()
abstract update(found_inf)

model.denoising.optimizer.optimizer module

Megatron optimizer.

class model.denoising.optimizer.optimizer.FP32Optimizer(optimizer, clip_grad, log_num_zeros_in_grad, params_have_main_grad)

Bases: MegatronOptimizer

get_loss_scale()

FP32 optimizer does not do any scaling.

load_state_dict(state_dict)
reload_model_params()

Refreshes any internal state from the current model parameters. Call whenever the parameters are changed outside of the optimizer. For example, when we load a model from a checkpoint without loading the optimizer, the model parameters are updated but for fp16 optimizer with main parameters, the main parameters need to also be updated.

state_dict()
step()

Clip gradients (if needed) and step the base optimizer. Always return successful since there is no overflow.

zero_grad(set_to_none=True)

Copied from torch.optim.optimizer

class model.denoising.optimizer.optimizer.Float16OptimizerWithFloat16Params(optimizer, clip_grad, log_num_zeros_in_grad, params_have_main_grad, bf16, grad_scaler)

Bases: MegatronOptimizer

Float16 optimizer for fp16 and bf16 data types.

Parameters:
  • optimizer – base optimizer such as Adam or SGD

  • clip_grad – clip gradeints with this global L2 norm. Note that clipping is ignored if clip_grad == 0

  • log_num_zeros_in_grad – return number of zeros in the gradients.

  • params_have_main_grad – flag indicating if parameters have a main_grad field. If this is set, we are assuming that the model parameters are store in the main_grad field instead of the typical grad field. This happens for the DDP cases where there is a contihuous buffer holding the gradients. For example for bfloat16, we want to do gradient accumulation and all-reduces in float32 and as a result we store those gradients in the main_grad. Note that main grad is not necessarily in float32.

  • bf16 – if true, the model is running in bfloat16.

  • grad_scaler – used for scaling gradients. Note that this can be None. This case happens when bf16 = True and we don’t use any loss scale. Note that for bf16 = True, we can have a constnat gradient scaler. Also for bf16 = False, we always require a grad scaler.

get_loss_scale()

The output should be a cuda tensor of size 1.

load_state_dict(state_dict)
reload_model_params()

Refreshes any internal state from the current model parameters. Call whenever the parameters are changed outside of the optimizer. For example, when we load a model from a checkpoint without loading the optimizer, the model parameters are updated but for fp16 optimizer with main parameters, the main parameters need to also be updated.

state_dict()
step()
zero_grad(set_to_none=True)

We only need to zero the model related parameters, i.e., float16_groups & fp32_from_fp32_groups. We additionally zero fp32_from_float16_groups as a memory optimization to reduce fragmentation; in the case of set_to_none==True, the space used by this field can be safely deallocated at this point.

class model.denoising.optimizer.optimizer.MegatronOptimizer(optimizer, clip_grad, log_num_zeros_in_grad, params_have_main_grad)

Bases: ABC

clip_grad_norm(clip_grad)
count_zeros()
abstract get_loss_scale()

The output should be a cuda tensor of size 1.

get_parameters()
abstract load_state_dict(state_dict)
property param_groups
abstract reload_model_params()

Refreshes any internal state from the current model parameters. Call whenever the parameters are changed outside of the optimizer. For example, when we load a model from a checkpoint without loading the optimizer, the model parameters are updated but for fp16 optimizer with main parameters, the main parameters need to also be updated.

scale_loss(loss)

Simple scaling.

property state
abstract state_dict()
abstract step()
abstract zero_grad(set_to_none=True)

Module contents

model.denoising.optimizer.get_megatron_optimizer(model)