model.denoising package

Subpackages

Submodules

model.denoising.arguments module

Megatron arguments.

model.denoising.arguments.parse_args(extra_args_provider=None, defaults={}, ignore_unknown_args=False)

Parse all arguments.

model.denoising.checkpointing module

Input/output checkpointing.

model.denoising.checkpointing.check_checkpoint_args(checkpoint_args)

Ensure fixed arguments for a model are the same for the input arguments and the one retrieved from checkpoint.

model.denoising.checkpointing.ensure_directory_exists(filename)

Build filename’s path if it does not already exists.

model.denoising.checkpointing.fix_query_key_value_ordering(model, checkpoint_version)

Fix up query/key/value matrix ordering if checkpoint version is smaller than 2.0

model.denoising.checkpointing.get_checkpoint_name(checkpoints_path, iteration, release=False)

A unified checkpoint name.

model.denoising.checkpointing.get_checkpoint_tracker_filename(checkpoints_path)

Tracker file rescords the latest chckpoint during training to restart from.

model.denoising.checkpointing.get_checkpoint_version()
model.denoising.checkpointing.load_biencoder_checkpoint(model, only_query_model=False, only_context_model=False, custom_load_path=None)

selectively load retrieval models for indexing/retrieving from saved checkpoints

model.denoising.checkpointing.load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True)

Load a model checkpoint and return the iteration. strict (bool): whether to strictly enforce that the keys in

state_dict of the checkpoint match the names of parameters and buffers in model.

model.denoising.checkpointing.save_checkpoint(iteration, model, optimizer, lr_scheduler)

Save a model checkpoint.

model.denoising.checkpointing.set_checkpoint_version(value)

model.denoising.global_vars module

Megatron global variables.

class model.denoising.global_vars.Timers

Bases: object

Group of timers.

log(names, normalizer=1.0, reset=True)

Log a group of timers.

write(names, writer, iteration, normalizer=1.0, reset=False)

Write timers to a tensorboard writer

model.denoising.global_vars.get_adlr_autoresume()

ADLR autoresume object. It can be None so no need to check if it is initialized.

model.denoising.global_vars.get_args()

Return arguments.

model.denoising.global_vars.get_current_global_batch_size()
model.denoising.global_vars.get_num_microbatches()
model.denoising.global_vars.get_tensorboard_writer()

Return tensorboard writer. It can be None so no need to check if it is initialized.

model.denoising.global_vars.get_timers()

Return timers.

model.denoising.global_vars.get_tokenizer()

Return tokenizer.

model.denoising.global_vars.set_global_variables(extra_args_provider=None, args_defaults={}, ignore_unknown_args=False)

Set args, tokenizer, tensorboard-writer, adlr-autoresume, and timers.

model.denoising.global_vars.update_num_microbatches(consumed_samples, consistency_check=True)

model.denoising.initialize module

Megatron initialization.

model.denoising.initialize.initialize_megatron(extra_args_provider=None, args_defaults={}, ignore_unknown_args=False, allow_no_cuda=False)

Set global variables, initialize distributed, and set autoresume and random seeds. allow_no_cuda should not be set unless using megatron for cpu only data processing. In general this arg should not be set unless you know what you are doing. Returns a function to finalize distributed env initialization (optionally, only when args.lazy_mpu_init == True)

model.denoising.initialize.write_args_to_tensorboard()

Write arguments to tensorboard.

model.denoising.learning_rates module

Learning rate decay functions.

class model.denoising.learning_rates.AnnealingLR(optimizer, max_lr, min_lr, warmup_steps, decay_steps, decay_style, use_checkpoint_lr_scheduler=True, override_lr_scheduler=False)

Bases: object

Anneals the learning rate.

get_lr()

Learning rate decay functions from: https://openreview.net/pdf?id=BJYwwY9ll pg. 4

load_state_dict(sd)
state_dict()
step(increment)

Set lr for all parameters groups.

model.denoising.memory module

class model.denoising.memory.MemoryBuffer(name, numel, dtype, track_usage)

Bases: object

Contiguous memory buffer. Allocate a contiguous memory of type dtype and size numel. It is used to reduce memory fragmentation.

Usage: After the allocation, the _start index is set tot the first

index of the memory. A memory chunk starting from _start index can be allocated for an input tensor, with the elements of the tensor being coppied. The buffer can be reused by resetting the _start index.

add(tensor)

Allocate a chunk of memory from the buffer to tensor and copy the values.

get_data()

Return the data currently in use.

is_in_use()

Whether the current buffer hold on to any memory.

numel_in_use()

Return number of elements in use.

print_average_usage()

Print memory usage average over time. We would like this value to be as high as possible.

reset()

Reset the buffer start index to the beginning of the buffer.

class model.denoising.memory.RingMemBuffer(name, num_buffers, numel, dtype, track_usage)

Bases: object

A ring of memory buffers.

get_next_buffer()
model.denoising.memory.allocate_mem_buff(name, numel, dtype, track_usage)

Allocate a memory buffer.

model.denoising.memory.get_mem_buff(name)

Get the memory buffer.

model.denoising.microbatches module

Megatron number of micro-batches calculators.

class model.denoising.microbatches.ConstantNumMicroBatches(global_batch_size, micro_batch_size, data_parallel_size)

Bases: NumMicroBatchesCalculator

update(consumed_samples, consistency_check)
class model.denoising.microbatches.NumMicroBatchesCalculator

Bases: ABC

get()
get_current_global_batch_size()
abstract update(consumed_samples, consistency_check)
class model.denoising.microbatches.RampupBatchsizeNumMicroBatches(start_batch_size, batch_size_increment, ramup_samples, global_batch_size, micro_batch_size, data_parallel_size)

Bases: NumMicroBatchesCalculator

update(consumed_samples, consistency_check)
model.denoising.microbatches.build_num_microbatches_calculator(args)

model.denoising.p2p_communication module

model.denoising.p2p_communication.recv_backward(timers=None)

Receive tensor from next rank in pipeline (backward receive).

model.denoising.p2p_communication.recv_forward(tensor_shape=None, override_scatter_gather_tensors_in_pipeline=False, dtype_=None, timers=None)

Receive tensor from previous rank in pipeline (forward receive).

model.denoising.p2p_communication.send_backward(input_tensor_grad, timers=None)

Send tensor to previous rank in pipeline (backward send).

model.denoising.p2p_communication.send_backward_recv_backward(input_tensor_grad, recv_next, timers=None)

Batched recv from next rank and send to previous rank in pipeline.

model.denoising.p2p_communication.send_backward_recv_forward(input_tensor_grad, timers=None)

Batched send and recv with previous rank in pipeline.

model.denoising.p2p_communication.send_forward(output_tensor, timers=None, override_scatter_gather_tensors_in_pipeline=False, dtype_=None)

Send tensor to next rank in pipeline (forward send).

model.denoising.p2p_communication.send_forward_backward_recv_forward_backward(output_tensor, input_tensor_grad, recv_prev, recv_next, timers=None)

Batched send and recv with previous and next ranks in pipeline.

model.denoising.p2p_communication.send_forward_recv_backward(output_tensor, timers=None)

Batched send and recv with next rank in pipeline.

model.denoising.p2p_communication.send_forward_recv_forward(output_tensor, recv_prev, timers=None)

Batched recv from previous rank and send to next rank in pipeline.

model.denoising.package_info module

model.denoising.schedules module

model.denoising.schedules.backward_step(optimizer, input_tensor, output_tensor, output_tensor_grad)

Backward step through passed-in output tensor.

If last stage, output_tensor_grad is None, otherwise gradient of loss with respect to stage’s output tensor.

Returns gradient of loss with respect to input tensor (None if first stage).

model.denoising.schedules.dummy_handler()
model.denoising.schedules.forward_backward_no_pipelining(forward_step_func, data_iterator, model, optimizer, timers, forward_only, test_only)

Run forward and backward passes with no pipeline parallelism (no inter-stage communication).

Returns dictionary with losses.

model.denoising.schedules.forward_backward_pipelining_with_interleaving(forward_step_func, data_iterator, model, optimizer, timers, forward_only)

Run interleaved 1F1B schedule (model split into model chunks), with communication between pipeline stages as needed.

Returns dictionary with losses if the last stage, empty dict otherwise.

model.denoising.schedules.forward_backward_pipelining_without_interleaving(forward_step_func, data_iterator, model, optimizer, timers, forward_only)

Run non-interleaved 1F1B schedule, with communication between pipeline stages.

Returns dictionary with losses if the last stage, empty dict otherwise.

model.denoising.schedules.forward_step(forward_step_func, data_iterator, model, input_tensor, losses_reduced)

Forward step for passed-in model.

If first stage, input tensor is obtained from data_iterator, otherwise passed-in input_tensor is used.

Returns output tensor.

model.denoising.schedules.forward_step_wrapper(forward_step_func, data_iterator, model, input_tensor, losses_reduced, test_only)

Forward step for passed-in model.

If first stage, input tensor is obtained from data_iterator, otherwise passed-in input_tensor is used.

Returns output tensor.

model.denoising.schedules.get_forward_backward_func()

model.denoising.training module

Pretrain utilities.

model.denoising.training.build_train_valid_test_data_iterators(build_train_valid_test_datasets_provider)

XXX

model.denoising.training.cyclic_iter(iter)
model.denoising.training.evaluate(forward_step_func, data_iterator, model, verbose=False)

Evaluation.

model.denoising.training.evaluate_and_print_results(prefix, forward_step_func, data_iterator, model, iteration, verbose=False)

Helper function to evaluate and dump results on screen.

model.denoising.training.get_learning_rate_scheduler(optimizer)

Build the learning rate scheduler.

model.denoising.training.get_model(model_provider_func)

Build the model.

model.denoising.training.pretrain(train_valid_test_dataset_provider, model_provider, forward_step_func, extra_args_provider=None, args_defaults={})

Main training program.

This function will run the followings in the order provided:
  1. initialize Megatron.

  2. setup model, optimizer and lr schedule using the model_provider.

  3. call train_val_test_data_provider to get train/val/test datasets.

  4. train the modle using the forward_step_func.

Parameters:
  • train_valid_test_dataset_provider – a function that takes the size of train/valid/test dataset and returns train, valid, test datasets.

  • model_provider – a function that returns a vanilla version of the model. By vanilla we mean a simple model on cpu with no fp16 or ddp.

  • forward_step_func – a function that takes a data iterator and model, and returns a loss scalar with a dictionary with key:values being the info we would like to monitor during training, for example lm-loss: value. We also require that this function add batch generator to the timers class.

  • extra_args_provider – a function that takes a parser and adds arguments to it. It is used for programs to add their own arguments.

  • args_defaults – a dictionary from argument-name to argument-value. It to set already parse arguments.

model.denoising.training.print_datetime(string)

Note that this call will sync across all ranks.

model.denoising.training.save_checkpoint_and_time(iteration, model, optimizer, lr_scheduler)
model.denoising.training.setup_model_and_optimizer(model_provider_func)

Setup model and optimizer.

model.denoising.training.train(forward_step_func, model, optimizer, lr_scheduler, train_data_iterator, valid_data_iterator, test_data_iterator)

Train the model function.

model.denoising.training.train_step(forward_step_func, data_iterator, model, optimizer, lr_scheduler)

Single training step.

model.denoising.training.training_log(loss_dict, total_loss_dict, learning_rate, iteration, loss_scale, report_memory_flag, skipped_iter, grad_norm, params_norm, num_zeros_in_grad)

Log training information such as losses, timing, ….

model.denoising.training.update_train_iters(args)

model.denoising.utils module

General utilities.

model.denoising.utils.average_losses_across_data_parallel_group(losses)

Reduce a tensor of losses across all GPUs.

model.denoising.utils.calc_params_l2_norm(model)

Calculate l2 norm of parameters

model.denoising.utils.check_adlr_autoresume_termination(iteration, model, optimizer, lr_scheduler)

Check for autoresume signal and exit if it is received.

model.denoising.utils.get_ltor_masks_and_position_ids(data, eod_token, reset_position_ids, reset_attention_mask, eod_mask_loss)

Build masks and position id for left to right model.

model.denoising.utils.print_params_min_max_norm(optimizer, iteration)

Print min, max, and norm of all parameters.

model.denoising.utils.report_memory(name)

Simple GPU memory report.

model.denoising.utils.unwrap_model(model, module_instances=torch.nn.parallel.DistributedDataParallel)

Module contents

model.denoising.is_last_rank()
model.denoising.print_rank_0(message)

If distributed is initialized, print only on rank 0.

model.denoising.print_rank_last(message)

If distributed is initialized, print only on last rank.