Training examples of AI-centered model
Signal Classification
Firstly, activating waveform environment.
Then, by running train_classify.py script, your own signal classification model can be trained.
1$ conda activate waveform
2$ cd /workspace/GWAI/demos
3$ python train_classify.py
You can modify classify.yaml to define your own training dataset as well as model configurations. For example:
1dataset:
2save_path: "../datasets/classify/"
3fn: emri_asd_test.hdf5
4dataloader:
5batch_size: 256
6num_workers: 8
7
8training:
9test_only: False
10checkpoint_dir:
11gpu: 0
12n_epoch: 50
13# loss_fn: "bce_with_logits"
14loss_fn: "cross_entropy"
15optimizer_type: "adam"
16optimizer_kwargs:
17 lr: 5e-5
18 weight_decay: 1e-3
19scheduler_type: "plateau"
20scheduler_kwargs:
21 mode: "min"
22 factor: 0.5
23 patience: 5
24 threshold: 1e-4
25result_dir: "./results//${now:%Y-%m-%d}/${now:%H-%M-%S}"
26result_fn: "inf_result.npy"
27use_wandb: False
28
29net:
30input_channels: 2
31n_classes: 2
32n_hidden: 128
33n_levels: 10
34kernel_size: 3
35num_classes: 2
36dropout: 0
The output log can be seen as follows.
1 [2024-02-04 10:25:46,915][nn.dataloader][INFO] - Loading data from ../datasets/detection/emri_asd_test.hdf5
2 Using Adam optimizer, lr=5e-05, weight_decay=0.001
3 Total parameters: 940.42K
4 Trainable parameters: 940.42K
5 Non-trainable parameters: 0
6 Epoch 1: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:01<00:00, 138.53it/s, loss=6.94e-01, acc=0.49] | 0/200 [00:00<?, ?it/s]Time: 0.010484933853149414
7 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:00<00:00, 223.66it/s, loss=6.91e-01, acc=0.5050]
8 [2024-02-04 10:25:54,895][nn.trainer][INFO] - EPOCH 1 : lr=5.00e-05, train_loss=6.94e-01, train_acc=0.4900, val_loss=6.91e-01 valid_acc=0.5050
9 Epoch 2: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:01<00:00, 156.30it/s, loss=6.91e-01, acc=0.50]
10 0%| | 0/200 [00:00<?, ?it/s]Time: 0.010904073715209961
Data Denoising
Firstly, downloading demo dataset (train_data, valid_data, test_data) from this repository.
and put it under datasets/denoise folder.
By running denoise_demo.sh script, your own denoising model can be trained.
You can modify configurations in denoise_demo.sh to build your own model with different model size.
1$ conda activate base
2$ cd /workspace/GWAI/demos
3$ bash denoise_demo.sh
The training parameters can be modified in denoise_demo.sh, for example:
1#!/bin/bash
2
3GPUS_PER_NODE=2
4MASTER_ADDR=localhost
5MASTER_PORT=6066
6NNODES=1
7NODE_RANK=0
8WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
9DATA_PATH=../dataset/denoise
10
11DETS=H1
12CHECKPOINT_PATH=demo
13
14DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
15
16export CUDA_VISIBLE_DEVICES=6,7
17python -m torch.distributed.launch $DISTRIBUTED_ARGS \
18 pretrain_gw.py \
19 --tensor-model-parallel-size 1 \
20 --pipeline-model-parallel-size 1 \
21 --num-layers 16 \
22 --hidden-size 1024 \
23 --num-attention-heads 16 \
24 --micro-batch-size 8 \
25 --segment-length 256 \
26 --dets $DETS \
27 --seq-length 128 \
28 --max-position-embeddings 128 \
29 --train-iters 30000 \
30 --save $CHECKPOINT_PATH \
31 --load $CHECKPOINT_PATH \
32 --data-path $DATA_PATH \
33 --data-impl mmap \
34 --split 949,50,1 \
35 --distributed-backend nccl \
36 --lr 0.0001 \
37 --lr-decay-style linear \
38 --min-lr 1.0e-5 \
39 --lr-decay-iters 9900 \
40 --weight-decay 1e-2 \
41 --clip-grad 1.0 \
42 --lr-warmup-fraction .002 \
43 --log-interval 1 \
44 --save-interval 10000 \
45 --eval-interval 1 \
46 --dataloader-type cyclic \
47 --fp16 \
48 --no-binary-head
The output log can be seen as follows.
1 using world size: 2, data-parallel-size: 2, tensor-model-parallel size: 1, pipeline-model-parallel size: 1
2 setting global batch size to 16
3 using torch.float16 for parameters ...
4 ------------------------ arguments ------------------------
5 accumulate_allreduce_grads_in_fp32 .............. False
6 adam_beta1 ...................................... 0.9
7 xxxxxxx
8 -------------------- end of arguments ---------------------
9 setting number of micro-batches to constant 1
10 > initializing torch distributed ...
11 > initializing tensor model parallel with size 1
12 > initializing pipeline model parallel with size 1
13 > setting random seeds to 1234 ...
14 > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
15 > compiling and loading fused kernels ...
16 Detected CUDA files, patching ldflags
17 Emitting ninja build file /workspace/GWAI/demo/../src/model/denoising/fused_kernels/build/build.ninja...
18 Building extension module scaled_upper_triang_masked_softmax_cuda...
19 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
20 ninja: no work to do.
21 Loading extension module scaled_upper_triang_masked_softmax_cuda...
22 Detected CUDA files, patching ldflags
23 Emitting ninja build file /workspace/GWAI/demo/../src/model/denoising/fused_kernels/build/build.ninja...
24 Building extension module scaled_masked_softmax_cuda...
25 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
26 ninja: no work to do.
27 Loading extension module scaled_masked_softmax_cuda...
28 Detected CUDA files, patching ldflags
29 Emitting ninja build file /workspace/GWAI/demo/../src/model/denoising/fused_kernels/build/build.ninja...
30 Building extension module fused_mix_prec_layer_norm_cuda...
31 Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
32 ninja: no work to do.
33 Loading extension module fused_mix_prec_layer_norm_cuda...
34 >>> done with compiling and loading fused kernels. Compilation time: 3.274 seconds
35 time to initialize megatron (seconds): 41.829
36 [after megatron is initialized] datetime: 2024-02-02 15:50:01
37 building WaveFormer model ...
38 > number of parameters on (tensor, pipeline) model parallel rank (0, 0): 220058673
39 > learning rate decay style: linear
40 WARNING: could not find the metadata file demo/latest_checkpointed_iteration.txt
41 will not load any checkpoints and will start from random
42 time (ms) | load-checkpoint: 0.16
43 [after model, optimizer, and learning rate scheduler are built] datetime: 2024-02-02 15:50:01
44 > building train, validation, and test datasets ...
45 > building train, validation, and test datasets for BERT ...
46 > finished creating BERT datasets ...
47 [after dataloaders are built] datetime: 2024-02-02 15:50:06
48 done with setup ...time (ms) | model-and-optimizer-setup: 111.39 | train/valid/test-data-iterators-setup: 4415.50
49
50 training ...
51 [before the start of training step] datetime: 2024-02-02 15:50:06
52 iteration 1/ 30000 | current time: 1706860208.35 | consumed samples: 16 | elapsed time per iteration (ms): 1996.1 | learning rate: 0.000E+00 | global batch size: 16 | loss scale: 4294967296.0 | number of skipped iterations: 1 | number of nan iterations: 0 |
53 time (ms) | backward-compute: 138.46 | backward-params-all-reduce: 32.71 | backward-embedding-all-reduce: 0.04 | optimizer-copy-to-main-grad: 3.17 | optimizer-unscale-and-check-inf: 42.67 | optimizer: 45.94 | batch-generator: 263.80
54 ----------------------------------------------------------------------------------------------------
55 validation loss at iteration 1 | lm loss value: 4.280033E-01 | lm loss PPL: 1.534191E+00 |
56 --------------------------------------------------------------------------------------------
57 iteration 2/ 30000 | current time: 1706860208.78 | consumed samples: 32 | elapsed time per iteration (ms): 429.4 | learning rate: 0.000E+00 | global batch size: 16 | loss scale: 2147483648.0 | number of skipped iterations: 1 | number of nan iterations: 0 |
58 time (ms) | backward-compute: 31.50 | backward-params-all-reduce: 35.43 | backward-embedding-all-reduce: 0.03 | optimizer-copy-to-main-grad: 2.87 | optimizer-unscale-and-check-inf: 12.14 | optimizer: 15.32 | batch-generator: 274.37
59 ----------------------------------------------------------------------------------------------------
60 validation loss at iteration 2 | lm loss value: 4.258614E-01 | lm loss PPL: 1.530909E+00 |
61 --------------------------------------------------------------------------------------------
Signal Detection
Firstly, activating waveform environment.
Then, by running train_detection.py script, your own detection model can be trained.
1$ conda activate waveform
2$ cd /workspace/GWAI/
3$ python demos/train_detection.py configs/detection.yaml
You can modify detection.yaml to define your own training dataset as well as model configurations. For example:
1# Basic parameters
2# Seed needs to be set at top of yaml, before objects with parameters are made
3#
4seed: 1607
5__set_seed: !apply:torch.manual_seed [!ref <seed>]
6
7# cuda device num
8cuda: 5
9# Data params
10data_folder: './datasets/detection'
11data_hdf5: smbhb_test.hdf5
12noise_hdf5: noise_test.hdf5
13
14experiment_name: detection_demo
15#----------------------------------------
16
17output_folder: !ref results/<experiment_name>/<seed>
18train_log: !ref <output_folder>/train_log.txt
19save_folder: !ref <output_folder>/save
20
21# Experiment params
22auto_mix_prec: False
23test_only: False
24num_spks: 1
25progressbar: True
26save_inf_data: False
27save_attention_weights: False
28# se loss * alpha + clsf loss * (1 - alpha)
29alpha: 1
30inf_data: !ref <save_folder>/inf_test/
31# att_data: !ref <save_folder>/inf_test/
32
33# Training parameters
34N_epochs: 100
35batch_size: 16
36lr: 0.0005
37clip_grad_norm: 5
38loss_upper_lim: 999999 # this is the upper limit for an acceptable loss
39# if True, the training sequences are cut to a specified length
40limit_training_signal_len: False
41# this is the length of sequences if we choose to limit
42# the signal length of training sequences
43training_signal_len: 4000
44dataloader_opts:
45 batch_size: !ref <batch_size>
46 num_workers: 3
47
48# loss thresholding -- this thresholds the training loss
49threshold_byloss: True
50threshold: -50
51
52# Encoder parameters
53N_encoder_out: 256
54out_channels: 256
55kernel_size: 16
56kernel_stride: 8
57
58
59# Specifying the network
60Encoder: !new:speechbrain.lobes.models.dual_path.Encoder
61 kernel_size: !ref <kernel_size>
62 out_channels: !ref <N_encoder_out>
63
64
65SBtfintra: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
66 num_layers: 2
67 d_model: !ref <out_channels>
68 nhead: 4
69 d_ffn: 256
70 dropout: 0
71 use_positional_encoding: True
72 norm_before: True
73
74SBtfinter: !new:speechbrain.lobes.models.dual_path.SBTransformerBlock
75 num_layers: 2
76 d_model: !ref <out_channels>
77 nhead: 4
78 d_ffn: 256
79 dropout: 0
80 use_positional_encoding: True
81 norm_before: True
82
83MaskNet: !new:speechbrain.lobes.models.dual_path.Dual_Path_Model
84 num_spks: !ref <num_spks>
85 in_channels: !ref <N_encoder_out>
86 out_channels: !ref <out_channels>
87 num_layers: 2
88 K: 25
89 intra_model: !ref <SBtfintra>
90 inter_model: !ref <SBtfinter>
91 norm: ln
92 linear_layer_after_inter_intra: False
93 skip_around_intra: True
94
95Decoder: !new:speechbrain.lobes.models.dual_path.Decoder
96 in_channels: !ref <N_encoder_out>
97 out_channels: 1
98 kernel_size: !ref <kernel_size>
99 stride: !ref <kernel_stride>
100 bias: False
101
102linear_1: !new:speechbrain.nnet.linear.Linear
103 input_size: !ref <training_signal_len>
104 n_neurons: 512
105
106relu: !new:torch.nn.ReLU
107
108linear_2: !new:speechbrain.nnet.linear.Linear
109 input_size: 512
110 n_neurons: 1
111
112optimizer: !name:torch.optim.Adam
113 lr: !ref <lr>
114 weight_decay: 0
115
116
117loss: !name:speechbrain.nnet.losses.get_si_snr_with_pitwrapper
118loss2: !name:speechbrain.nnet.losses.bce_loss
119
120lr_scheduler: !new:speechbrain.nnet.schedulers.ReduceLROnPlateau
121 factor: 0.5
122 patience: 2
123 dont_halve_until_epoch: 35
124
125epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
126 limit: !ref <N_epochs>
127
128modules:
129 encoder: !ref <Encoder>
130 decoder: !ref <Decoder>
131 masknet: !ref <MaskNet>
132 linear_1: !ref <linear_1>
133 linear_2: !ref <linear_2>
134
135checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
136 checkpoints_dir: !ref <save_folder>
137 recoverables:
138 encoder: !ref <Encoder>
139 decoder: !ref <Decoder>
140 masknet: !ref <MaskNet>
141 linear_1: !ref <linear_1>
142 linear_2: !ref <linear_2>
143 counter: !ref <epoch_counter>
144 lr_scheduler: !ref <lr_scheduler>
145 # mlp: !ref <MLP>
146
147train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
148 save_file: !ref <train_log>
The output log can be seen as follows.
1 speechbrain.core - Beginning experiment!
2 speechbrain.core - Experiment folder: results/detection_demo22/1607
3 speechbrain.core - Info: test_only arg overridden by command line input to: False
4 speechbrain.core - Info: auto_mix_prec arg from hparam file is used
5 speechbrain.core - 5.6M trainable parameters in Separation
6 speechbrain.utils.checkpoints - Would load a checkpoint here, but none found yet.
7 speechbrain.utils.epoch_loop - Going into epoch 1
8 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00, 5.45it/s, loss1=6.18, loss2=0.693, train_loss=6.18]
9 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.50it/s]
10 speechbrain.utils.train_logger - epoch: 1, lr: 5.00e-04 - train si-snr: 6.18, train loss1: 6.18, train loss2: 6.93e-01 - valid si-snr: -6.32e-01, valid loss1: -6.32e-01, valid loss2: 6.96e-01
11 speechbrain.utils.checkpoints - Saved an end-of-epoch checkpoint in results/detection_demo22/1607/save/CKPT+2024-02-02+15-55-58+00
12 speechbrain.utils.epoch_loop - Going into epoch 2
13 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13/13 [00:02<00:00, 5.72it/s, loss1=-2.26, loss2=0.693, train_loss=-2.26]
14 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.47it/s]
15 speechbrain.utils.train_logger - epoch: 2, lr: 5.00e-04 - train si-snr: -2.26e+00, train loss1: -2.26e+00, train loss2: 6.93e-01 - valid si-snr: -2.13e+00, valid loss1: -2.13e+00, valid loss2: 6.97e-01
16 speechbrain.utils.checkpoints - Saved an end-of-epoch checkpoint in results/detection_demo22/1607/save/CKPT+2024-02-02+15-56-01+00
17 speechbrain.utils.checkpoints - Deleted checkpoint in results/detection_demo22/1607/save/CKPT+2024-02-02+15-55-58+00
18 speechbrain.utils.epoch_loop - Going into epoch 3