model.denoising.model package

Submodules

model.denoising.model.distributed module

class model.denoising.model.distributed.DistributedDataParallel(*args: Any, **kwargs: Any)

Bases: DistributedDataParallelBase

DDP with contiguous buffers options to storre and accumulate gradients. This class:

  • has the potential to reduce memory fragmentation.

  • provides the option to do the gradient accumulation in a type other than the params type (for example fp32)

Parameters:
  • module – input model.

  • accumulate_allreduce_grads_in_fp32 – if true do the gradient accumulation and the gradient all-reduce all in in float32. If this option is true, we require use_contiguous_buffers to be true too.

  • use_contiguous_buffers – if true, use a contiguous buffer to store the gradients.

allreduce_gradients()

Reduce gradients across data parallel ranks.

zero_grad_buffer()

Set the grad buffer data to zero. Needs to be called at the begining of each iteration.

class model.denoising.model.distributed.DistributedDataParallelBase(*args: Any, **kwargs: Any)

Bases: MegatronModule, ABC

Abstract class for DDP.

abstract allreduce_gradients()
forward(*inputs, **kwargs)
load_state_dict(state_dict, strict=True)
state_dict(destination=None, prefix='', keep_vars=False)
state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)

Use this function to override the state dict for saving checkpoints.

class model.denoising.model.distributed.MemoryBuffer(numel, dtype)

Bases: object

get(shape, start_index)

Return a tensor with the input shape as a view into the 1-D data starting at start_index.

zero()

Reset the buffer to zero.

model.denoising.model.enums module

class model.denoising.model.enums.AttnMaskType(value)

Bases: Enum

An enumeration.

causal = 2
padding = 1
class model.denoising.model.enums.AttnType(value)

Bases: Enum

An enumeration.

cross_attn = 2
self_attn = 1
value_conv = 3
class model.denoising.model.enums.LayerType(value)

Bases: Enum

An enumeration.

decoder = 2
encoder = 1

model.denoising.model.fused_bias_gelu module

class model.denoising.model.fused_bias_gelu.GeLUFunction(*args: Any, **kwargs: Any)

Bases: Function

static backward(ctx, grad_output)
static forward(ctx, input, bias)
model.denoising.model.fused_bias_gelu.bias_gelu(bias, y)
model.denoising.model.fused_bias_gelu.bias_gelu_back(g, bias, y)

model.denoising.model.fused_layer_norm module

This code is copied fron NVIDIA apex:

https://github.com/NVIDIA/apex

with some changes.

class model.denoising.model.fused_layer_norm.FusedLayerNormAffineFunction(*args: Any, **kwargs: Any)

Bases: Function

static backward(ctx, grad_output)
static forward(ctx, input, weight, bias, normalized_shape, eps)
class model.denoising.model.fused_layer_norm.MixedFusedLayerNorm(*args: Any, **kwargs: Any)

Bases: Module

forward(input)
reset_parameters()

model.denoising.model.fused_softmax module

class model.denoising.model.fused_softmax.FusedScaleMaskSoftmax(*args: Any, **kwargs: Any)

Bases: Module

fused operation: scaling + mask + softmax :param input_in_fp16: flag to indicate if input in fp16 data format. :param attn_mask_type: attention mask type (pad or causal) :param mask_func: mask function to be applied. :param softmax_in_fp32: if true, softmax in performed at fp32 precision. :param scale: scaling factor used in input tensor scaling.

forward(input, mask)
class model.denoising.model.fused_softmax.ScaledMaskedSoftmax(*args: Any, **kwargs: Any)

Bases: Function

Fused operation which performs following three operations in sequence 1. Scale the tensor. 2. Apply the mask. 3. Perform softmax.

static backward(ctx, output_grads)
static forward(ctx, inputs, mask, scale)
class model.denoising.model.fused_softmax.ScaledUpperTriangMaskedSoftmax(*args: Any, **kwargs: Any)

Bases: Function

Fused operation which performs following three operations in sequence 1. Scale the tensor. 2. Apply upper triangular mask (typically used in gpt models). 3. Perform softmax.

static backward(ctx, output_grads)
static forward(ctx, inputs, scale)

model.denoising.model.module module

Megatron Module

class model.denoising.model.module.Float16Module(*args: Any, **kwargs: Any)

Bases: MegatronModule

forward(*inputs, **kwargs)
load_state_dict(state_dict, strict=True)
state_dict(destination=None, prefix='', keep_vars=False)
state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)

Use this function to override the state dict for saving checkpoints.

class model.denoising.model.module.MegatronModule(*args: Any, **kwargs: Any)

Bases: Module

Megatron specific extensions of torch Module with support for pipelining.

initialize_word_embeddings(init_method_normal)
state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)

Use this function to override the state dict for saving checkpoints.

word_embeddings_weight()
model.denoising.model.module.conversion_helper(val, conversion)

Apply conversion to val. Recursively apply conversion if val #is a nested tuple/list structure.

model.denoising.model.module.float16_to_fp32(val)

Convert fp16/bf16 val to fp32

model.denoising.model.module.fp32_to_float16(val, float16_convertor)

Convert fp32 val to fp16/bf16

model.denoising.model.module.param_is_not_shared(param)

model.denoising.model.transformer module

Transformer.

class model.denoising.model.transformer.ParallelAttention(*args: Any, **kwargs: Any)

Bases: MegatronModule

Parallel self-attention layer abstract class.

Self-attention layer takes input with size [b, s, h] and returns output of the same size.

forward(hidden_states, attention_mask, layer_past=None, get_key_value=False, encoder_output=None, get_atten_value=False)
class model.denoising.model.transformer.ParallelMLP(*args: Any, **kwargs: Any)

Bases: MegatronModule

MLP.

MLP will take the input with h hidden state, project it to 4*h hidden dimension, perform nonlinear transformation, and project the state back into h hidden dimension. At the end, dropout is also applied.

forward(hidden_states)
class model.denoising.model.transformer.ParallelTransformer(*args: Any, **kwargs: Any)

Bases: MegatronModule

Transformer class.

forward(hidden_states, attention_mask, layer_past=None, get_key_value=False, encoder_output=None, enc_dec_attn_mask=None)
set_input_tensor(input_tensor)

Set input tensor to be used instead of forward()’s input.

When doing pipeline parallelism the input from the previous stage comes from communication, not from the input, so the model’s forward_step_func won’t have it. This function is thus used by internal code to bypass the input provided by the forward_step_func

class model.denoising.model.transformer.ParallelTransformerLayer(*args: Any, **kwargs: Any)

Bases: MegatronModule

A single transformer layer.

Transformer layer takes input with size [b, s, h] and returns an output of the same size.

forward(hidden_states, attention_mask, encoder_output=None, enc_dec_attn_mask=None, layer_past=None, get_key_value=False, get_atten_value=False)
class model.denoising.model.transformer.SwiGLU(*args: Any, **kwargs: Any)

Bases: Module

forward(x)
model.denoising.model.transformer.bias_dropout_add(x: Tensor, bias: Tensor, residual: Tensor, prob: float, training: bool) Tensor
model.denoising.model.transformer.bias_dropout_add_fused_inference(x: Tensor, bias: Tensor, residual: Tensor, prob: float) Tensor
model.denoising.model.transformer.bias_dropout_add_fused_train(x: Tensor, bias: Tensor, residual: Tensor, prob: float) Tensor
model.denoising.model.transformer.get_bias_dropout_add(training)

model.denoising.model.utils module

Utilities for models.

model.denoising.model.utils.attention_mask_func(attention_scores, attention_mask)
model.denoising.model.utils.erf_gelu(x)
model.denoising.model.utils.gelu_impl(x)

OpenAI’s gelu implementation.

model.denoising.model.utils.get_linear_layer(rows, columns, init_method)

Simple linear layer with weight initialization.

model.denoising.model.utils.init_method_normal(sigma)

Init method based on N(0, sigma).

model.denoising.model.utils.openai_gelu(x)
model.denoising.model.utils.scaled_init_method_normal(sigma, num_layers)

Init method based on N(0, sigma/sqrt(2*num_layers).

model.denoising.model.waveform_model module

Transformer based language model.

class model.denoising.model.waveform_model.Embedding(*args: Any, **kwargs: Any)

Bases: MegatronModule

Language model embeddings.

Parameters:
  • hidden_size – hidden size

  • vocab_size – vocabulary size

  • max_sequence_length – maximum size of sequence. This is used for positional embedding

  • embedding_dropout_prob – dropout probability for embeddings

  • init_method – weight initialization method

  • num_tokentypes – size of the token-type embeddings. 0 value will ignore this embedding

forward(input_ids, position_ids, tokentype_ids=None)
load_state_dict(state_dict, strict=True)

Customized load.

state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)

For easy load.

class model.denoising.model.waveform_model.Pooler(*args: Any, **kwargs: Any)

Bases: MegatronModule

Pooler layer.

Pool hidden states of a specific token (for example start of the sequence) and add a linear transformation followed by a tanh.

Parameters:
  • hidden_size – hidden size

  • init_method – weight initialization method for the linear layer. bias is set to zero.

forward(hidden_states, sequence_index=0)
class model.denoising.model.waveform_model.TransformerWaveformModel(*args: Any, **kwargs: Any)

Bases: MegatronModule

Transformer language model.

Parameters:
  • transformer_hparams – transformer hyperparameters

  • vocab_size – vocabulary size

  • max_sequence_length – maximum size of sequence. This is used for positional embedding

  • embedding_dropout_prob – dropout probability for embeddings

  • num_tokentypes – size of the token-type embeddings. 0 value will ignore this embedding

forward(enc_input_ids, enc_position_ids, enc_attn_mask, dec_input_ids=None, dec_position_ids=None, dec_attn_mask=None, enc_dec_attn_mask=None, tokentype_ids=None, layer_past=None, get_key_value=False, pooling_sequence_index=0, enc_hidden_states=None, output_enc_hidden=False)
load_state_dict(state_dict, strict=True)

Customized load.

set_input_tensor(input_tensor)

See megatron.model.transformer.set_input_tensor()

state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)

For easy load.

model.denoising.model.waveform_model.get_waveform_model(num_tokentypes, add_pooler, encoder_attn_mask_type, init_method=None, scaled_init_method=None, add_decoder=False, decoder_attn_mask_type=AttnMaskType.causal, pre_process=True, post_process=True, get_atten_value=False)

Build language model and return along with the key to save.

model.denoising.model.waveform_model.parallel_gw_logits(input_, word_embeddings_weight, parallel_output, bias=None)

LM logits using word embedding weights.

model.denoising.model.waveformer_model module

WaveFormer model.

class model.denoising.model.waveformer_model.GWHead(*args: Any, **kwargs: Any)

Bases: MegatronModule

Masked GW head for WaveFormer

Parameters:
  • hidden_size – hidden size

  • init_method – init method for weight initialization

  • layernorm_epsilon – tolerance for layer norm divisions

  • parallel_output – whether output logits being distributed or not.

forward(hidden_states, word_embeddings_weight)
class model.denoising.model.waveformer_model.WaveFormerModel(*args: Any, **kwargs: Any)

Bases: MegatronModule

Bert Language model.

forward(bert_model_input, attention_mask, tokentype_ids=None, gw_labels=None)
load_state_dict(state_dict, strict=True)

Customized load.

set_input_tensor(input_tensor)

See megatron.model.transformer.set_input_tensor()

state_dict_for_save_checkpoint(destination=None, prefix='', keep_vars=False)

For easy load when model is combined with other heads, add an extra key.

model.denoising.model.waveformer_model.bert_position_ids(token_ids, dets=1)
model.denoising.model.waveformer_model.gw_extended_attention_mask(attention_mask)
model.denoising.model.waveformer_model.post_waveform_model_processing(gw_output, pooled_output, gw_head, binary_head, logit_weights, get_atten_value)

Module contents