model.denoising.model package
Submodules
model.denoising.model.distributed module
- class model.denoising.model.distributed.DistributedDataParallel(*args: Any, **kwargs: Any)
Bases:
DistributedDataParallelBaseDDP with contiguous buffers options to storre and accumulate gradients. This class:
has the potential to reduce memory fragmentation.
provides the option to do the gradient accumulation in a type other than the params type (for example fp32)
- Parameters:
module – input model.
accumulate_allreduce_grads_in_fp32 – if true do the gradient accumulation and the gradient all-reduce all in in float32. If this option is true, we require use_contiguous_buffers to be true too.
use_contiguous_buffers – if true, use a contiguous buffer to store the gradients.
- allreduce_gradients()
Reduce gradients across data parallel ranks.
- zero_grad_buffer()
Set the grad buffer data to zero. Needs to be called at the begining of each iteration.
- class model.denoising.model.distributed.DistributedDataParallelBase(*args: Any, **kwargs: Any)
Bases:
MegatronModule,ABCAbstract class for DDP.
- abstract allreduce_gradients()
- forward(*inputs, **kwargs)
- load_state_dict(state_dict, strict=True)
- state_dict(destination=None, prefix='', keep_vars=False)
- state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)
Use this function to override the state dict for saving checkpoints.
model.denoising.model.enums module
- class model.denoising.model.enums.AttnMaskType(value)
Bases:
EnumAn enumeration.
- causal = 2
- padding = 1
model.denoising.model.fused_bias_gelu module
- class model.denoising.model.fused_bias_gelu.GeLUFunction(*args: Any, **kwargs: Any)
Bases:
Function- static backward(ctx, grad_output)
- static forward(ctx, input, bias)
- model.denoising.model.fused_bias_gelu.bias_gelu(bias, y)
- model.denoising.model.fused_bias_gelu.bias_gelu_back(g, bias, y)
model.denoising.model.fused_layer_norm module
- This code is copied fron NVIDIA apex:
with some changes.
model.denoising.model.fused_softmax module
- class model.denoising.model.fused_softmax.FusedScaleMaskSoftmax(*args: Any, **kwargs: Any)
Bases:
Modulefused operation: scaling + mask + softmax :param input_in_fp16: flag to indicate if input in fp16 data format. :param attn_mask_type: attention mask type (pad or causal) :param mask_func: mask function to be applied. :param softmax_in_fp32: if true, softmax in performed at fp32 precision. :param scale: scaling factor used in input tensor scaling.
- forward(input, mask)
- class model.denoising.model.fused_softmax.ScaledMaskedSoftmax(*args: Any, **kwargs: Any)
Bases:
FunctionFused operation which performs following three operations in sequence 1. Scale the tensor. 2. Apply the mask. 3. Perform softmax.
- static backward(ctx, output_grads)
- static forward(ctx, inputs, mask, scale)
- class model.denoising.model.fused_softmax.ScaledUpperTriangMaskedSoftmax(*args: Any, **kwargs: Any)
Bases:
FunctionFused operation which performs following three operations in sequence 1. Scale the tensor. 2. Apply upper triangular mask (typically used in gpt models). 3. Perform softmax.
- static backward(ctx, output_grads)
- static forward(ctx, inputs, scale)
model.denoising.model.module module
Megatron Module
- class model.denoising.model.module.Float16Module(*args: Any, **kwargs: Any)
Bases:
MegatronModule- forward(*inputs, **kwargs)
- load_state_dict(state_dict, strict=True)
- state_dict(destination=None, prefix='', keep_vars=False)
- state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)
Use this function to override the state dict for saving checkpoints.
- class model.denoising.model.module.MegatronModule(*args: Any, **kwargs: Any)
Bases:
ModuleMegatron specific extensions of torch Module with support for pipelining.
- initialize_word_embeddings(init_method_normal)
- state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)
Use this function to override the state dict for saving checkpoints.
- word_embeddings_weight()
- model.denoising.model.module.conversion_helper(val, conversion)
Apply conversion to val. Recursively apply conversion if val #is a nested tuple/list structure.
- model.denoising.model.module.float16_to_fp32(val)
Convert fp16/bf16 val to fp32
- model.denoising.model.module.fp32_to_float16(val, float16_convertor)
Convert fp32 val to fp16/bf16
model.denoising.model.transformer module
Transformer.
- class model.denoising.model.transformer.ParallelAttention(*args: Any, **kwargs: Any)
Bases:
MegatronModuleParallel self-attention layer abstract class.
Self-attention layer takes input with size [b, s, h] and returns output of the same size.
- forward(hidden_states, attention_mask, layer_past=None, get_key_value=False, encoder_output=None, get_atten_value=False)
- class model.denoising.model.transformer.ParallelMLP(*args: Any, **kwargs: Any)
Bases:
MegatronModuleMLP.
MLP will take the input with h hidden state, project it to 4*h hidden dimension, perform nonlinear transformation, and project the state back into h hidden dimension. At the end, dropout is also applied.
- forward(hidden_states)
- class model.denoising.model.transformer.ParallelTransformer(*args: Any, **kwargs: Any)
Bases:
MegatronModuleTransformer class.
- forward(hidden_states, attention_mask, layer_past=None, get_key_value=False, encoder_output=None, enc_dec_attn_mask=None)
- set_input_tensor(input_tensor)
Set input tensor to be used instead of forward()’s input.
When doing pipeline parallelism the input from the previous stage comes from communication, not from the input, so the model’s forward_step_func won’t have it. This function is thus used by internal code to bypass the input provided by the forward_step_func
- class model.denoising.model.transformer.ParallelTransformerLayer(*args: Any, **kwargs: Any)
Bases:
MegatronModuleA single transformer layer.
Transformer layer takes input with size [b, s, h] and returns an output of the same size.
- forward(hidden_states, attention_mask, encoder_output=None, enc_dec_attn_mask=None, layer_past=None, get_key_value=False, get_atten_value=False)
- class model.denoising.model.transformer.SwiGLU(*args: Any, **kwargs: Any)
Bases:
Module- forward(x)
- model.denoising.model.transformer.bias_dropout_add(x: Tensor, bias: Tensor, residual: Tensor, prob: float, training: bool) Tensor
- model.denoising.model.transformer.bias_dropout_add_fused_inference(x: Tensor, bias: Tensor, residual: Tensor, prob: float) Tensor
- model.denoising.model.transformer.bias_dropout_add_fused_train(x: Tensor, bias: Tensor, residual: Tensor, prob: float) Tensor
- model.denoising.model.transformer.get_bias_dropout_add(training)
model.denoising.model.utils module
Utilities for models.
- model.denoising.model.utils.attention_mask_func(attention_scores, attention_mask)
- model.denoising.model.utils.erf_gelu(x)
- model.denoising.model.utils.gelu_impl(x)
OpenAI’s gelu implementation.
- model.denoising.model.utils.get_linear_layer(rows, columns, init_method)
Simple linear layer with weight initialization.
- model.denoising.model.utils.init_method_normal(sigma)
Init method based on N(0, sigma).
- model.denoising.model.utils.openai_gelu(x)
- model.denoising.model.utils.scaled_init_method_normal(sigma, num_layers)
Init method based on N(0, sigma/sqrt(2*num_layers).
model.denoising.model.waveform_model module
Transformer based language model.
- class model.denoising.model.waveform_model.Embedding(*args: Any, **kwargs: Any)
Bases:
MegatronModuleLanguage model embeddings.
- Parameters:
hidden_size – hidden size
vocab_size – vocabulary size
max_sequence_length – maximum size of sequence. This is used for positional embedding
embedding_dropout_prob – dropout probability for embeddings
init_method – weight initialization method
num_tokentypes – size of the token-type embeddings. 0 value will ignore this embedding
- forward(input_ids, position_ids, tokentype_ids=None)
- load_state_dict(state_dict, strict=True)
Customized load.
- state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)
For easy load.
- class model.denoising.model.waveform_model.Pooler(*args: Any, **kwargs: Any)
Bases:
MegatronModulePooler layer.
Pool hidden states of a specific token (for example start of the sequence) and add a linear transformation followed by a tanh.
- Parameters:
hidden_size – hidden size
init_method – weight initialization method for the linear layer. bias is set to zero.
- forward(hidden_states, sequence_index=0)
- class model.denoising.model.waveform_model.TransformerWaveformModel(*args: Any, **kwargs: Any)
Bases:
MegatronModuleTransformer language model.
- Parameters:
transformer_hparams – transformer hyperparameters
vocab_size – vocabulary size
max_sequence_length – maximum size of sequence. This is used for positional embedding
embedding_dropout_prob – dropout probability for embeddings
num_tokentypes – size of the token-type embeddings. 0 value will ignore this embedding
- forward(enc_input_ids, enc_position_ids, enc_attn_mask, dec_input_ids=None, dec_position_ids=None, dec_attn_mask=None, enc_dec_attn_mask=None, tokentype_ids=None, layer_past=None, get_key_value=False, pooling_sequence_index=0, enc_hidden_states=None, output_enc_hidden=False)
- load_state_dict(state_dict, strict=True)
Customized load.
- set_input_tensor(input_tensor)
See megatron.model.transformer.set_input_tensor()
- state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)
For easy load.
- model.denoising.model.waveform_model.get_waveform_model(num_tokentypes, add_pooler, encoder_attn_mask_type, init_method=None, scaled_init_method=None, add_decoder=False, decoder_attn_mask_type=AttnMaskType.causal, pre_process=True, post_process=True, get_atten_value=False)
Build language model and return along with the key to save.
- model.denoising.model.waveform_model.parallel_gw_logits(input_, word_embeddings_weight, parallel_output, bias=None)
LM logits using word embedding weights.
model.denoising.model.waveformer_model module
WaveFormer model.
- class model.denoising.model.waveformer_model.GWHead(*args: Any, **kwargs: Any)
Bases:
MegatronModuleMasked GW head for WaveFormer
- Parameters:
hidden_size – hidden size
init_method – init method for weight initialization
layernorm_epsilon – tolerance for layer norm divisions
parallel_output – whether output logits being distributed or not.
- forward(hidden_states, word_embeddings_weight)
- class model.denoising.model.waveformer_model.WaveFormerModel(*args: Any, **kwargs: Any)
Bases:
MegatronModuleBert Language model.
- forward(bert_model_input, attention_mask, tokentype_ids=None, gw_labels=None)
- load_state_dict(state_dict, strict=True)
Customized load.
- set_input_tensor(input_tensor)
See megatron.model.transformer.set_input_tensor()
- state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)
For easy load when model is combined with other heads, add an extra key.
- model.denoising.model.waveformer_model.bert_position_ids(token_ids, dets=1)
- model.denoising.model.waveformer_model.gw_extended_attention_mask(attention_mask)
- model.denoising.model.waveformer_model.post_waveform_model_processing(gw_output, pooled_output, gw_head, binary_head, logit_weights, get_atten_value)