model.denoising package

Subpackages

Submodules

model.denoising.arguments module

Megatron arguments.

model.denoising.arguments.parse_args(extra_args_provider=None, defaults={}, ignore_unknown_args=False): Parse all arguments.

model.denoising.checkpointing module

Input/output checkpointing.

model.denoising.checkpointing.check_checkpoint_args(checkpoint_args): Ensure fixed arguments for a model are the same for the input arguments and the one retrieved from checkpoint.

model.denoising.checkpointing.ensure_directory_exists(filename): Build filename’s path if it does not already exists.

model.denoising.checkpointing.fix_query_key_value_ordering(model, checkpoint_version): Fix up query/key/value matrix ordering if checkpoint version is smaller than 2.0

model.denoising.checkpointing.get_checkpoint_name(checkpoints_path, iteration, release=False): A unified checkpoint name.

model.denoising.checkpointing.get_checkpoint_tracker_filename(checkpoints_path): Tracker file rescords the latest chckpoint during training to restart from.

model.denoising.checkpointing.get_checkpoint_version()

model.denoising.checkpointing.load_biencoder_checkpoint(model, only_query_model=False, only_context_model=False, custom_load_path=None): selectively load retrieval models for indexing/retrieving from saved checkpoints

model.denoising.checkpointing.load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True): Load a model checkpoint and return the iteration. strict (bool): whether to strictly enforce that the keys in

state_dict of the checkpoint match the names of parameters and buffers in model.

model.denoising.checkpointing.save_checkpoint(iteration, model, optimizer, lr_scheduler): Save a model checkpoint.

model.denoising.checkpointing.set_checkpoint_version(value)

model.denoising.global_vars module

Megatron global variables.

class model.denoising.global_vars.Timers

Bases: object

Group of timers.

log(names, normalizer=1.0, reset=True): Log a group of timers.

write(names, writer, iteration, normalizer=1.0, reset=False): Write timers to a tensorboard writer

model.denoising.global_vars.get_adlr_autoresume(): ADLR autoresume object. It can be None so no need to check if it is initialized.

model.denoising.global_vars.get_args(): Return arguments.

model.denoising.global_vars.get_current_global_batch_size()

model.denoising.global_vars.get_num_microbatches()

model.denoising.global_vars.get_tensorboard_writer(): Return tensorboard writer. It can be None so no need to check if it is initialized.

model.denoising.global_vars.get_timers(): Return timers.

model.denoising.global_vars.get_tokenizer(): Return tokenizer.

model.denoising.global_vars.set_global_variables(extra_args_provider=None, args_defaults={}, ignore_unknown_args=False): Set args, tokenizer, tensorboard-writer, adlr-autoresume, and timers.

model.denoising.global_vars.update_num_microbatches(consumed_samples, consistency_check=True)

model.denoising.initialize module

Megatron initialization.

model.denoising.initialize.initialize_megatron(extra_args_provider=None, args_defaults={}, ignore_unknown_args=False, allow_no_cuda=False): Set global variables, initialize distributed, and set autoresume and random seeds. allow_no_cuda should not be set unless using megatron for cpu only data processing. In general this arg should not be set unless you know what you are doing. Returns a function to finalize distributed env initialization (optionally, only when args.lazy_mpu_init == True)

model.denoising.initialize.write_args_to_tensorboard(): Write arguments to tensorboard.

model.denoising.learning_rates module

Learning rate decay functions.

class model.denoising.learning_rates.AnnealingLR(optimizer, max_lr, min_lr, warmup_steps, decay_steps, decay_style, use_checkpoint_lr_scheduler=True, override_lr_scheduler=False)

Bases: object

Anneals the learning rate.

get_lr(): Learning rate decay functions from: https://openreview.net/pdf?id=BJYwwY9ll pg. 4

load_state_dict(sd)

state_dict()

step(increment): Set lr for all parameters groups.

model.denoising.memory module

class model.denoising.memory.MemoryBuffer(name, numel, dtype, track_usage)

Bases: object

Contiguous memory buffer. Allocate a contiguous memory of type dtype and size numel. It is used to reduce memory fragmentation.

Usage: After the allocation, the _start index is set tot the first: index of the memory. A memory chunk starting from _start index can be allocated for an input tensor, with the elements of the tensor being coppied. The buffer can be reused by resetting the _start index.

add(tensor): Allocate a chunk of memory from the buffer to tensor and copy the values.

get_data(): Return the data currently in use.

is_in_use(): Whether the current buffer hold on to any memory.

numel_in_use(): Return number of elements in use.

print_average_usage(): Print memory usage average over time. We would like this value to be as high as possible.

reset(): Reset the buffer start index to the beginning of the buffer.

class model.denoising.memory.RingMemBuffer(name, num_buffers, numel, dtype, track_usage)

Bases: object

A ring of memory buffers.

get_next_buffer()

model.denoising.memory.allocate_mem_buff(name, numel, dtype, track_usage): Allocate a memory buffer.

model.denoising.memory.get_mem_buff(name): Get the memory buffer.

model.denoising.microbatches module

Megatron number of micro-batches calculators.

class model.denoising.microbatches.ConstantNumMicroBatches(global_batch_size, micro_batch_size, data_parallel_size)

Bases: NumMicroBatchesCalculator

update(consumed_samples, consistency_check)

class model.denoising.microbatches.NumMicroBatchesCalculator

Bases: ABC

get()

get_current_global_batch_size()

abstract update(consumed_samples, consistency_check)

class model.denoising.microbatches.RampupBatchsizeNumMicroBatches(start_batch_size, batch_size_increment, ramup_samples, global_batch_size, micro_batch_size, data_parallel_size)

Bases: NumMicroBatchesCalculator

update(consumed_samples, consistency_check)

model.denoising.microbatches.build_num_microbatches_calculator(args)

model.denoising.p2p_communication module

model.denoising.p2p_communication.recv_backward(timers=None): Receive tensor from next rank in pipeline (backward receive).

model.denoising.p2p_communication.recv_forward(tensor_shape=None, override_scatter_gather_tensors_in_pipeline=False, dtype_=None, timers=None): Receive tensor from previous rank in pipeline (forward receive).

model.denoising.p2p_communication.send_backward(input_tensor_grad, timers=None): Send tensor to previous rank in pipeline (backward send).

model.denoising.p2p_communication.send_backward_recv_backward(input_tensor_grad, recv_next, timers=None): Batched recv from next rank and send to previous rank in pipeline.

model.denoising.p2p_communication.send_backward_recv_forward(input_tensor_grad, timers=None): Batched send and recv with previous rank in pipeline.

model.denoising.p2p_communication.send_forward(output_tensor, timers=None, override_scatter_gather_tensors_in_pipeline=False, dtype_=None): Send tensor to next rank in pipeline (forward send).

model.denoising.p2p_communication.send_forward_backward_recv_forward_backward(output_tensor, input_tensor_grad, recv_prev, recv_next, timers=None): Batched send and recv with previous and next ranks in pipeline.

model.denoising.p2p_communication.send_forward_recv_backward(output_tensor, timers=None): Batched send and recv with next rank in pipeline.

model.denoising.p2p_communication.send_forward_recv_forward(output_tensor, recv_prev, timers=None): Batched recv from previous rank and send to next rank in pipeline.

model.denoising.package_info module

model.denoising.schedules module

model.denoising.schedules.backward_step(optimizer, input_tensor, output_tensor, output_tensor_grad)

Backward step through passed-in output tensor.

If last stage, output_tensor_grad is None, otherwise gradient of loss with respect to stage’s output tensor.

Returns gradient of loss with respect to input tensor (None if first stage).

model.denoising.schedules.dummy_handler()

model.denoising.schedules.forward_backward_no_pipelining(forward_step_func, data_iterator, model, optimizer, timers, forward_only, test_only)

Run forward and backward passes with no pipeline parallelism (no inter-stage communication).

Returns dictionary with losses.

model.denoising.schedules.forward_backward_pipelining_with_interleaving(forward_step_func, data_iterator, model, optimizer, timers, forward_only)

Run interleaved 1F1B schedule (model split into model chunks), with communication between pipeline stages as needed.

Returns dictionary with losses if the last stage, empty dict otherwise.

model.denoising.schedules.forward_backward_pipelining_without_interleaving(forward_step_func, data_iterator, model, optimizer, timers, forward_only)

Run non-interleaved 1F1B schedule, with communication between pipeline stages.

Returns dictionary with losses if the last stage, empty dict otherwise.

model.denoising.schedules.forward_step(forward_step_func, data_iterator, model, input_tensor, losses_reduced)

Forward step for passed-in model.

If first stage, input tensor is obtained from data_iterator, otherwise passed-in input_tensor is used.

Returns output tensor.

model.denoising.schedules.forward_step_wrapper(forward_step_func, data_iterator, model, input_tensor, losses_reduced, test_only)

Forward step for passed-in model.

If first stage, input tensor is obtained from data_iterator, otherwise passed-in input_tensor is used.

Returns output tensor.

model.denoising.schedules.get_forward_backward_func()

model.denoising.training module

Pretrain utilities.

model.denoising.training.build_train_valid_test_data_iterators(build_train_valid_test_datasets_provider): XXX

model.denoising.training.cyclic_iter(iter)

model.denoising.training.evaluate(forward_step_func, data_iterator, model, verbose=False): Evaluation.

model.denoising.training.evaluate_and_print_results(prefix, forward_step_func, data_iterator, model, iteration, verbose=False): Helper function to evaluate and dump results on screen.

model.denoising.training.get_learning_rate_scheduler(optimizer): Build the learning rate scheduler.

model.denoising.training.get_model(model_provider_func): Build the model.

model.denoising.training.pretrain(train_valid_test_dataset_provider, model_provider, forward_step_func, extra_args_provider=None, args_defaults={})

Main training program.

This function will run the followings in the order provided:

initialize Megatron.
setup model, optimizer and lr schedule using the model_provider.
call train_val_test_data_provider to get train/val/test datasets.
train the modle using the forward_step_func.

Parameters:

train_valid_test_dataset_provider – a function that takes the size of train/valid/test dataset and returns train, valid, test datasets.
model_provider – a function that returns a vanilla version of the model. By vanilla we mean a simple model on cpu with no fp16 or ddp.
forward_step_func – a function that takes a data iterator and model, and returns a loss scalar with a dictionary with key:values being the info we would like to monitor during training, for example lm-loss: value. We also require that this function add batch generator to the timers class.
extra_args_provider – a function that takes a parser and adds arguments to it. It is used for programs to add their own arguments.
args_defaults – a dictionary from argument-name to argument-value. It to set already parse arguments.

model.denoising.training.print_datetime(string): Note that this call will sync across all ranks.

model.denoising.training.save_checkpoint_and_time(iteration, model, optimizer, lr_scheduler)

model.denoising.training.setup_model_and_optimizer(model_provider_func): Setup model and optimizer.

model.denoising.training.train(forward_step_func, model, optimizer, lr_scheduler, train_data_iterator, valid_data_iterator, test_data_iterator): Train the model function.

model.denoising.training.train_step(forward_step_func, data_iterator, model, optimizer, lr_scheduler): Single training step.

model.denoising.training.training_log(loss_dict, total_loss_dict, learning_rate, iteration, loss_scale, report_memory_flag, skipped_iter, grad_norm, params_norm, num_zeros_in_grad): Log training information such as losses, timing, ….

model.denoising.training.update_train_iters(args)

model.denoising.utils module

General utilities.

model.denoising.utils.average_losses_across_data_parallel_group(losses): Reduce a tensor of losses across all GPUs.

model.denoising.utils.calc_params_l2_norm(model): Calculate l2 norm of parameters

model.denoising.utils.check_adlr_autoresume_termination(iteration, model, optimizer, lr_scheduler): Check for autoresume signal and exit if it is received.

model.denoising.utils.get_ltor_masks_and_position_ids(data, eod_token, reset_position_ids, reset_attention_mask, eod_mask_loss): Build masks and position id for left to right model.

model.denoising.utils.print_params_min_max_norm(optimizer, iteration): Print min, max, and norm of all parameters.

model.denoising.utils.report_memory(name): Simple GPU memory report.

model.denoising.utils.unwrap_model(model, module_instances=torch.nn.parallel.DistributedDataParallel)

Module contents

model.denoising.is_last_rank()

model.denoising.print_rank_0(message): If distributed is initialized, print only on rank 0.

model.denoising.print_rank_last(message): If distributed is initialized, print only on last rank.