Training examples of AI-centered model

Signal Classification

Firstly, activating waveform environment. Then, by running train_classify.py script, your own signal classification model can be trained.

1$ conda activate waveform
2$ cd /workspace/GWAI/demos
3$ python train_classify.py

You can modify classify.yaml to define your own training dataset as well as model configurations. For example:

 1dataset:
 2save_path: "../datasets/classify/"
 3fn: emri_asd_test.hdf5
 4dataloader:
 5batch_size: 256
 6num_workers: 8
 7
 8training:
 9test_only: False
10checkpoint_dir:
11gpu: 0
12n_epoch: 50
13# loss_fn: "bce_with_logits"
14loss_fn: "cross_entropy"
15optimizer_type: "adam"
16optimizer_kwargs:
17    lr: 5e-5
18    weight_decay: 1e-3
19scheduler_type: "plateau"
20scheduler_kwargs:
21    mode: "min"
22    factor: 0.5
23    patience: 5
24    threshold: 1e-4
25result_dir: "./results//${now:%Y-%m-%d}/${now:%H-%M-%S}"
26result_fn: "inf_result.npy"
27use_wandb: False
28
29net:
30input_channels: 2
31n_classes: 2
32n_hidden: 128
33n_levels: 10
34kernel_size: 3
35num_classes: 2
36dropout: 0

The output log can be seen as follows.

 1  [2024-02-04 10:25:46,915][nn.dataloader][INFO] - Loading data from ../datasets/detection/emri_asd_test.hdf5
 2  Using Adam optimizer, lr=5e-05, weight_decay=0.001
 3  Total parameters: 940.42K
 4  Trainable parameters: 940.42K
 5  Non-trainable parameters: 0
 6  Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:01<00:00, 138.53it/s, loss=6.94e-01, acc=0.49]                                                                                                                                                                                                 | 0/200 [00:00<?, ?it/s]Time: 0.010484933853149414
 7  100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 223.66it/s, loss=6.91e-01, acc=0.5050]
 8  [2024-02-04 10:25:54,895][nn.trainer][INFO] - EPOCH 1   : lr=5.00e-05,   train_loss=6.94e-01,    train_acc=0.4900,       val_loss=6.91e-01       valid_acc=0.5050
 9  Epoch 2: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:01<00:00, 156.30it/s, loss=6.91e-01, acc=0.50]
10  0%|                                                                                                                                                                                                  | 0/200 [00:00<?, ?it/s]Time: 0.010904073715209961

Data Denoising

Firstly, downloading demo dataset (train_data, valid_data, test_data) from this repository. and put it under datasets/denoise folder. By running denoise_demo.sh script, your own denoising model can be trained.

You can modify configurations in denoise_demo.sh to build your own model with different model size.

1$ conda activate base
2$ cd /workspace/GWAI/demos
3$ bash denoise_demo.sh

The training parameters can be modified in denoise_demo.sh, for example:

 1#!/bin/bash
 2
 3GPUS_PER_NODE=2
 4MASTER_ADDR=localhost
 5MASTER_PORT=6066
 6NNODES=1
 7NODE_RANK=0
 8WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
 9DATA_PATH=../dataset/denoise
10
11DETS=H1
12CHECKPOINT_PATH=demo
13
14DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
15
16export CUDA_VISIBLE_DEVICES=6,7
17python -m torch.distributed.launch $DISTRIBUTED_ARGS \
18    pretrain_gw.py \
19    --tensor-model-parallel-size 1 \
20    --pipeline-model-parallel-size 1 \
21    --num-layers 16 \
22    --hidden-size 1024 \
23    --num-attention-heads 16 \
24    --micro-batch-size 8 \
25    --segment-length 256 \
26    --dets $DETS \
27    --seq-length 128 \
28    --max-position-embeddings 128 \
29    --train-iters 30000 \
30    --save $CHECKPOINT_PATH \
31    --load $CHECKPOINT_PATH \
32    --data-path $DATA_PATH \
33    --data-impl mmap \
34    --split 949,50,1 \
35    --distributed-backend nccl \
36    --lr 0.0001 \
37    --lr-decay-style linear \
38    --min-lr 1.0e-5 \
39    --lr-decay-iters 9900 \
40    --weight-decay 1e-2 \
41    --clip-grad 1.0 \
42    --lr-warmup-fraction .002 \
43    --log-interval 1 \
44    --save-interval 10000 \
45    --eval-interval 1 \
46    --dataloader-type cyclic \
47    --fp16 \
48    --no-binary-head

The output log can be seen as follows.

 1  using world size: 2, data-parallel-size: 2, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
 2  setting global batch size to 16
 3  using torch.float16 for parameters ...
 4  ------------------------ arguments ------------------------
 5  accumulate_allreduce_grads_in_fp32 .............. False
 6  adam_beta1 ...................................... 0.9
 7  xxxxxxx
 8  -------------------- end of arguments ---------------------
 9  setting number of micro-batches to constant 1
10  > initializing torch distributed ...
11  > initializing tensor model parallel with size 1
12  > initializing pipeline model parallel with size 1
13  > setting random seeds to 1234 ...
14  > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
15  > compiling and loading fused kernels ...
16  Detected CUDA files, patching ldflags
17  Emitting ninja build file /workspace/GWAI/demo/../src/model/denoising/fused_kernels/build/build.ninja...
18  Building extension module scaled_upper_triang_masked_softmax_cuda...
19  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
20  ninja: no work to do.
21  Loading extension module scaled_upper_triang_masked_softmax_cuda...
22  Detected CUDA files, patching ldflags
23  Emitting ninja build file /workspace/GWAI/demo/../src/model/denoising/fused_kernels/build/build.ninja...
24  Building extension module scaled_masked_softmax_cuda...
25  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
26  ninja: no work to do.
27  Loading extension module scaled_masked_softmax_cuda...
28  Detected CUDA files, patching ldflags
29  Emitting ninja build file /workspace/GWAI/demo/../src/model/denoising/fused_kernels/build/build.ninja...
30  Building extension module fused_mix_prec_layer_norm_cuda...
31  Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
32  ninja: no work to do.
33  Loading extension module fused_mix_prec_layer_norm_cuda...
34  >>> done with compiling and loading fused kernels. Compilation time: 3.274 seconds
35  time to initialize megatron (seconds): 41.829
36  [after megatron is initialized] datetime: 2024-02-02 15:50:01
37  building WaveFormer model ...
38  > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 220058673
39  > learning rate decay style: linear
40  WARNING: could not find the metadata file demo/latest_checkpointed_iteration.txt
41     will not load any checkpoints and will start from random
42  time (ms) | load-checkpoint: 0.16
43  [after model, optimizer, and learning rate scheduler are built] datetime: 2024-02-02 15:50:01
44  > building train, validation, and test datasets ...
45  > building train, validation, and test datasets for BERT ...
46  > finished creating BERT datasets ...
47  [after dataloaders are built] datetime: 2024-02-02 15:50:06
48  done with setup ...time (ms) | model-and-optimizer-setup: 111.39 | train/valid/test-data-iterators-setup: 4415.50
49
50  training ...
51  [before the start of training step] datetime: 2024-02-02 15:50:06
52  iteration        1/   30000 | current time: 1706860208.35 | consumed samples:           16 | elapsed time per iteration (ms): 1996.1 | learning rate: 0.000E+00 | global batch size:    16 | loss scale: 4294967296.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
53  time (ms) | backward-compute: 138.46 | backward-params-all-reduce: 32.71 | backward-embedding-all-reduce: 0.04 | optimizer-copy-to-main-grad: 3.17 | optimizer-unscale-and-check-inf: 42.67 | optimizer: 45.94 | batch-generator: 263.80
54  ----------------------------------------------------------------------------------------------------
55  validation loss at iteration 1 | lm loss value: 4.280033E-01 | lm loss PPL: 1.534191E+00 |
56  --------------------------------------------------------------------------------------------
57  iteration        2/   30000 | current time: 1706860208.78 | consumed samples:           32 | elapsed time per iteration (ms): 429.4 | learning rate: 0.000E+00 | global batch size:    16 | loss scale: 2147483648.0 | number of skipped iterations:   1 | number of nan iterations:   0 |
58  time (ms) | backward-compute: 31.50 | backward-params-all-reduce: 35.43 | backward-embedding-all-reduce: 0.03 | optimizer-copy-to-main-grad: 2.87 | optimizer-unscale-and-check-inf: 12.14 | optimizer: 15.32 | batch-generator: 274.37
59  ----------------------------------------------------------------------------------------------------
60  validation loss at iteration 2 | lm loss value: 4.258614E-01 | lm loss PPL: 1.530909E+00 |
61  --------------------------------------------------------------------------------------------

Signal Detection

Firstly, activating waveform environment. Then, by running train_detection.py script, your own detection model can be trained.

1$ conda activate waveform
2$ cd /workspace/GWAI/
3$ python demos/train_detection.py configs/detection.yaml

You can modify detection.yaml to define your own training dataset as well as model configurations. For example:

  1# Basic parameters
  2# Seed needs to be set at top of yaml, before objects with parameters are made
  3#
  4seed: 1607
  5__set_seed: !apply:torch.manual_seed [!ref <seed>]
  6
  7# cuda device num
  8cuda: 5
  9# Data params
 10data_folder: './datasets/detection'
 11data_hdf5: smbhb_test.hdf5
 12noise_hdf5: noise_test.hdf5
 13
 14experiment_name: detection_demo
 15#----------------------------------------
 16
 17output_folder: !ref results/<experiment_name>/<seed>
 18train_log: !ref <output_folder>/train_log.txt
 19save_folder: !ref <output_folder>/save
 20
 21# Experiment params
 22auto_mix_prec: False
 23test_only: False
 24num_spks: 1
 25progressbar: True
 26save_inf_data: False
 27save_attention_weights: False
 28# se loss * alpha + clsf loss * (1 - alpha)
 29alpha: 1
 30inf_data: !ref <save_folder>/inf_test/
 31# att_data: !ref <save_folder>/inf_test/
 32
 33# Training parameters
 34N_epochs: 100
 35batch_size: 16
 36lr: 0.0005
 37clip_grad_norm: 5
 38loss_upper_lim: 999999  # this is the upper limit for an acceptable loss
 39# if True, the training sequences are cut to a specified length
 40limit_training_signal_len: False
 41# this is the length of sequences if we choose to limit
 42# the signal length of training sequences
 43training_signal_len: 4000
 44dataloader_opts:
 45    batch_size: !ref <batch_size>
 46    num_workers: 3
 47
 48# loss thresholding -- this thresholds the training loss
 49threshold_byloss: True
 50threshold: -50
 51
 52# Encoder parameters
 53N_encoder_out: 256
 54out_channels: 256
 55kernel_size: 16
 56kernel_stride: 8
 57
 58
 59# Specifying the network
 60Encoder: !new:speechbrain.lobes.models.dual_path.Encoder
 61    kernel_size: !ref <kernel_size>
 62    out_channels: !ref <N_encoder_out>
 63
 64
 65SBtfintra: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
 66    num_layers: 2
 67    d_model: !ref <out_channels>
 68    nhead: 4
 69    d_ffn: 256
 70    dropout: 0
 71    use_positional_encoding: True
 72    norm_before: True
 73
 74SBtfinter: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
 75    num_layers: 2
 76    d_model: !ref <out_channels>
 77    nhead: 4
 78    d_ffn: 256
 79    dropout: 0
 80    use_positional_encoding: True
 81    norm_before: True
 82
 83MaskNet: !new:speechbrain.lobes.models.dual_path.Dual_Path_Model
 84    num_spks: !ref <num_spks>
 85    in_channels: !ref <N_encoder_out>
 86    out_channels: !ref <out_channels>
 87    num_layers: 2
 88    K: 25
 89    intra_model: !ref <SBtfintra>
 90    inter_model: !ref <SBtfinter>
 91    norm: ln
 92    linear_layer_after_inter_intra: False
 93    skip_around_intra: True
 94
 95Decoder: !new:speechbrain.lobes.models.dual_path.Decoder
 96    in_channels: !ref <N_encoder_out>
 97    out_channels: 1
 98    kernel_size: !ref <kernel_size>
 99    stride: !ref <kernel_stride>
100    bias: False
101
102linear_1: !new:speechbrain.nnet.linear.Linear
103    input_size: !ref <training_signal_len>
104    n_neurons: 512
105
106relu: !new:torch.nn.ReLU
107
108linear_2: !new:speechbrain.nnet.linear.Linear
109    input_size: 512
110    n_neurons: 1
111
112optimizer: !name:torch.optim.Adam
113    lr: !ref <lr>
114    weight_decay: 0
115
116
117loss: !name:speechbrain.nnet.losses.get_si_snr_with_pitwrapper
118loss2: !name:speechbrain.nnet.losses.bce_loss
119
120lr_scheduler: !new:speechbrain.nnet.schedulers.ReduceLROnPlateau
121    factor: 0.5
122    patience: 2
123    dont_halve_until_epoch: 35
124
125epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
126    limit: !ref <N_epochs>
127
128modules:
129    encoder: !ref <Encoder>
130    decoder: !ref <Decoder>
131    masknet: !ref <MaskNet>
132    linear_1: !ref <linear_1>
133    linear_2: !ref <linear_2>
134
135checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
136    checkpoints_dir: !ref <save_folder>
137    recoverables:
138        encoder: !ref <Encoder>
139        decoder: !ref <Decoder>
140        masknet: !ref <MaskNet>
141        linear_1: !ref <linear_1>
142        linear_2: !ref <linear_2>
143        counter: !ref <epoch_counter>
144        lr_scheduler: !ref <lr_scheduler>
145        # mlp: !ref <MLP>
146
147train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
148    save_file: !ref <train_log>

The output log can be seen as follows.

 1  speechbrain.core - Beginning experiment!
 2  speechbrain.core - Experiment folder: results/detection_demo22/1607
 3  speechbrain.core - Info: test_only arg overridden by command line input to: False
 4  speechbrain.core - Info: auto_mix_prec arg from hparam file is used
 5  speechbrain.core - 5.6M trainable parameters in Separation
 6  speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
 7  speechbrain.utils.epoch_loop - Going into epoch 1
 8  100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  5.45it/s, loss1=6.18, loss2=0.693, train_loss=6.18]
 9  100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.50it/s]
10  speechbrain.utils.train_logger - epoch: 1, lr: 5.00e-04 - train si-snr: 6.18, train loss1: 6.18, train loss2: 6.93e-01 - valid si-snr: -6.32e-01, valid loss1: -6.32e-01, valid loss2: 6.96e-01
11  speechbrain.utils.checkpoints - Saved an end-of-epoch checkpoint in results/detection_demo22/1607/save/CKPT+2024-02-02+15-55-58+00
12  speechbrain.utils.epoch_loop - Going into epoch 2
13  100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00,  5.72it/s, loss1=-2.26, loss2=0.693, train_loss=-2.26]
14  100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.47it/s]
15  speechbrain.utils.train_logger - epoch: 2, lr: 5.00e-04 - train si-snr: -2.26e+00, train loss1: -2.26e+00, train loss2: 6.93e-01 - valid si-snr: -2.13e+00, valid loss1: -2.13e+00, valid loss2: 6.97e-01
16  speechbrain.utils.checkpoints - Saved an end-of-epoch checkpoint in results/detection_demo22/1607/save/CKPT+2024-02-02+15-56-01+00
17  speechbrain.utils.checkpoints - Deleted checkpoint in results/detection_demo22/1607/save/CKPT+2024-02-02+15-55-58+00
18  speechbrain.utils.epoch_loop - Going into epoch 3