model.denoising.mpu package

Submodules

model.denoising.mpu.data module

model.denoising.mpu.data.broadcast_data(keys, data, datatype)

Broadcast data from rank zero of each model parallel group to the members of the same model parallel group.

Parameters:

keys – list of keys in the data disctionary to be broadcasted
data – data dictionary of string keys and cpu tensor values.
datatype – torch data type of all tensors in data associated with keys.

model.denoising.mpu.initialize module

Model and data parallel groups.

model.denoising.mpu.initialize.destroy_model_parallel(): Set the groups to none.

model.denoising.mpu.initialize.get_data_parallel_group(): Get the data parallel group the caller rank belongs to.

model.denoising.mpu.initialize.get_data_parallel_rank(): Return my rank for the data parallel group.

model.denoising.mpu.initialize.get_data_parallel_world_size(): Return world size for the data parallel group.

model.denoising.mpu.initialize.get_embedding_group(): Get the embedding group the caller rank belongs to.

model.denoising.mpu.initialize.get_model_parallel_group(): Get the model parallel group the caller rank belongs to.

model.denoising.mpu.initialize.get_pipeline_model_parallel_first_rank()

model.denoising.mpu.initialize.get_pipeline_model_parallel_group(): Get the pipeline model parallel group the caller rank belongs to.

model.denoising.mpu.initialize.get_pipeline_model_parallel_last_rank()

model.denoising.mpu.initialize.get_pipeline_model_parallel_next_rank()

model.denoising.mpu.initialize.get_pipeline_model_parallel_prev_rank()

model.denoising.mpu.initialize.get_pipeline_model_parallel_rank(): Return my rank for the pipeline model parallel group.

model.denoising.mpu.initialize.get_pipeline_model_parallel_world_size(): Return world size for the pipeline model parallel group.

model.denoising.mpu.initialize.get_tensor_model_parallel_group(): Get the tensor model parallel group the caller rank belongs to.

model.denoising.mpu.initialize.get_tensor_model_parallel_rank(): Return my rank for the tensor model parallel group.

model.denoising.mpu.initialize.get_tensor_model_parallel_src_rank(): Calculate the global rank corresponding to the first local rank in the tensor model parallel group.

model.denoising.mpu.initialize.get_tensor_model_parallel_world_size(): Return world size for the tensor model parallel group.

model.denoising.mpu.initialize.get_virtual_pipeline_model_parallel_rank(): Return the virtual pipeline-parallel rank.

model.denoising.mpu.initialize.get_virtual_pipeline_model_parallel_world_size(): Return the virtual pipeline-parallel world size.

model.denoising.mpu.initialize.initialize_model_parallel(tensor_model_parallel_size_=1, pipeline_model_parallel_size_=1, virtual_pipeline_model_parallel_size_=None)

Initialize model data parallel groups.

Parameters:

tensor_model_parallel_size – number of GPUs used to parallelize model tensor.
pipeline_model_parallel_size – number of GPUs used to parallelize model pipeline.

Let’s say we have a total of 16 GPUs denoted by g0 … g15 and we use 2 GPUs to parallelize the model tensor, and 4 GPUs to parallelize the model pipeline. The present function will create 8 tensor model-parallel groups, 4 pipeline model-parallel groups and 8 data-parallel groups as:

8 data_parallel groups:
[g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]

8 tensor model-parallel groups:
[g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]

4 pipeline model-parallel groups:
[g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]

Note that for efficiency, the caller should make sure adjacent ranks are on the same DGX box. For example if we are using 2 DGX-1 boxes with a total of 16 GPUs, rank 0 to 7 belong to the first box and ranks 8 to 15 belong to the second box.

model.denoising.mpu.initialize.is_pipeline_first_stage(ignore_virtual=False): Return True if in the first pipeline model-parallel stage, False otherwise.

model.denoising.mpu.initialize.is_pipeline_last_stage(ignore_virtual=False): Return True if in the last pipeline model-parallel stage, False otherwise.

model.denoising.mpu.initialize.is_unitialized(): Useful for code segments that may be accessed with or without mpu initialization

model.denoising.mpu.initialize.model_parallel_is_initialized(): Check if model and data parallel groups are initialized.

model.denoising.mpu.initialize.set_pipeline_model_parallel_rank(rank): Set pipeline model parallel rank.

model.denoising.mpu.initialize.set_pipeline_model_parallel_world_size(world_size): Set the pipeline model parallel size

model.denoising.mpu.initialize.set_tensor_model_parallel_rank(rank): Set tensor model parallel rank.

model.denoising.mpu.initialize.set_tensor_model_parallel_world_size(world_size): Set the tensor model parallel size

model.denoising.mpu.initialize.set_virtual_pipeline_model_parallel_rank(rank): Set the virtual pipeline-parallel rank.

model.denoising.mpu.layers module

class model.denoising.mpu.layers.ColumnParallelLinear(*args: Any, **kwargs: Any)

Bases: Module

Linear layer with column parallelism.

The linear layer is defined as Y = XA + b. A is parallelized along its second dimension as A = [A_1, …, A_p].

Parameters:

input_size – first dimension of matrix A.
output_size – second dimension of matrix A.
bias – If true, add bias
gather_output – If true, call all-gether on output and make Y avaiable to all GPUs, otherwise, every GPU will have its output which is Y_i = XA_i
init_method – method to initialize weights. Note that bias is always set to zero.
stride – For the strided linear layers.
keep_master_weight_for_test – This was added for testing and should be set to False. It returns the master weights used for initialization.
skip_bias_add – This was added to enable performance optimations where bias can be fused with other elementwise operations. we skip adding bias but instead return it.

forward(input_)

class model.denoising.mpu.layers.RowParallelLinear(*args: Any, **kwargs: Any)

Bases: Module

Linear layer with row parallelism.

The linear layer is defined as Y = XA + b. A is parallelized along its first dimension and X along its second dimension as:

A_1 |

. |

A = | . | X = [X_1, …, X_p]

. |

A_p | - -

Parameters:

input_size – first dimension of matrix A.
output_size – second dimension of matrix A.
bias – If true, add bias. Note that bias is not parallelized.
input_is_parallel – If true, we assume that the input is already split across the GPUs and we do not split again.
init_method – method to initialize weights. Note that bias is always set to zero.
stride – For the strided linear layers.
keep_master_weight_for_test – This was added for testing and should be set to False. It returns the master weights used for initialization.
skip_bias_add – This was added to enable performance optimations where bias can be fused with other elementwise operations. we skip adding bias but instead return it.

forward(input_)

class model.denoising.mpu.layers.VocabParallelEmbedding(*args: Any, **kwargs: Any)

Bases: Module

Embedding parallelized in the vocabulary dimension.

This is mainly adapted from torch.nn.Embedding and all the default values are kept. :param num_embeddings: vocabulary size. :param embedding_dim: size of hidden state. :param init_method: method to initialize weights.

forward(input_)

model.denoising.mpu.layers.copy_tensor_model_parallel_attributes(destination_tensor, source_tensor)

model.denoising.mpu.layers.param_is_not_tensor_parallel_duplicate(param)

model.denoising.mpu.layers.set_defaults_if_not_set_tensor_model_parallel_attributes(tensor)

model.denoising.mpu.layers.set_tensor_model_parallel_attributes(tensor, is_parallel, dim, stride)

model.denoising.mpu.mappings module

model.denoising.mpu.mappings.copy_to_tensor_model_parallel_region(input_)

model.denoising.mpu.mappings.gather_from_tensor_model_parallel_region(input_)

model.denoising.mpu.mappings.reduce_from_tensor_model_parallel_region(input_)

model.denoising.mpu.mappings.scatter_to_tensor_model_parallel_region(input_)

model.denoising.mpu.random module

class model.denoising.mpu.random.CheckpointFunction(*args: Any, **kwargs: Any)

Bases: Function

This function is adapted from torch.utils.checkpoint with two main changes:

torch.cuda.set_rng_state is replaced with _set_cuda_rng_state

the states in the model parallel tracker are also properly tracked/set/reset.

static backward(ctx, *args)

static forward(ctx, run_function, *args)

class model.denoising.mpu.random.CudaRNGStatesTracker

Bases: object

Tracker for the cuda RNG states.

Using the add method, a cuda rng state is initialized based on the input seed and is assigned to name. Later, by forking the rng state, we can perform operations and return to our starting cuda state.

add(name, seed): Track the rng state.

fork(name='model-parallel-rng'): Fork the cuda rng state, perform operations, and exit with the original state.

get_states(): Get rng states. Copy the dictionary so we have direct pointers to the states, not just a pointer to the dictionary.

reset(): Set to the initial state (no tracker).

set_states(states): Set the rng states. For efficiency purposes, we do not check the size of seed for compatibility.

model.denoising.mpu.random.checkpoint(function, *args): Checkpoint a model or part of the model. This has been directly copied from torch.utils.checkpoint.

model.denoising.mpu.random.gather_split_1d_tensor(tensor): Opposite of above function, gather values from model parallel ranks.

model.denoising.mpu.random.get_cuda_rng_tracker(): Get cuda rng tracker.

model.denoising.mpu.random.init_checkpointed_activations_memory_buffer(): Initializ the memory buffer for the checkpointed activations.

model.denoising.mpu.random.model_parallel_cuda_manual_seed(seed)

Initialize model parallel cuda seed.

This function should be called after the model parallel is initialized. Also, no torch.cuda.manual_seed should be called after this function. Basically, this is replacement for that function. Two set of RNG states are tracked:

default state: This is for data parallelism and is the same among a
set of model parallel GPUs but different across different model paralle groups. This is used for example for dropout in the non-tensor-model-parallel regions.

tensor-model-parallel state: This state is different among a set of model
parallel GPUs, but the same across data parallel groups. This is used for example for dropout in model parallel regions.

model.denoising.mpu.random.reset_checkpointed_activations_memory_buffer(): Reset the memory used for checkpointing.

model.denoising.mpu.random.split_tensor_into_1d_equal_chunks(tensor): Break a tensor into equal 1D chunks.

model.denoising.mpu.utils module

class model.denoising.mpu.utils.VocabUtility

Bases: object

Split the vocabulary into world_size chunks amd return the first and last index of the vocabulary belonging to the rank partition: Note that indecies in [fist, last)

static vocab_range_from_global_vocab_size(global_vocab_size, rank, world_size)

static vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank, world_size)

model.denoising.mpu.utils.divide(numerator, denominator): Ensure that numerator is divisible by the denominator and return the division value.

model.denoising.mpu.utils.ensure_divisibility(numerator, denominator): Ensure that numerator is divisible by the denominator.

model.denoising.mpu.utils.split_tensor_along_last_dim(tensor, num_partitions, contiguous_split_chunks=False): Split a tensor along its last dimension. :param tensor: input tensor. :param num_partitions: number of partitions to split the tensor :param contiguous_split_chunks: If True, make each chunk contiguous

in memory.

Module contents

Model parallel utility interface.