Question 1:
When I start my model training using the yaml file given below, the following message appears with high frequency:
unigram_model.cc(494) LOG(WARNING) Too big agenda size 10003. Shrinking (round 1) down to 100.
unigram_model.cc(494) LOG(WARNING) Too big agenda size 10002. Shrinking (round 2) down to 100.
unigram_model.cc(494) LOG(WARNING) Too big agenda size 10003. Shrinking (round 3) down to 100.
unigram_model.cc(494) LOG(WARNING) Too big agenda size 10003. Shrinking (round 4) down to 100.
unigram_model.cc(494) LOG(WARNING) Too big agenda size 10003. Shrinking (round 1) down to 100.
unigram_model.cc(494) LOG(WARNING) Too big agenda size 10003. Shrinking (round 1) down to 100.
.....
[2024-01-19 12:37:57,483 WARNING] The batch will be filled until we reach 8, its size may exceed 4096 tokens
[2024-01-19 12:37:57,483 WARNING] The batch will be filled until we reach 8, its size may exceed 4096 tokens
[2024-01-19 12:37:57,483 WARNING] The batch will be filled until we reach 8, its size may exceed 4096 tokens
[2024-01-19 12:37:58,932 WARNING] The batch will be filled until we reach 8, its size may exceed 4096 tokens
[2024-01-19 12:38:03,003 WARNING] The batch will be filled until we reach 8, its size may exceed 4096 tokens
[2024-01-19 12:38:03,004 WARNING] The batch will be filled until we reach 8, its size may exceed 4096 tokens
....
Grad overflow on iteration 7459
Using dynamic loss scale of 1
Grad overflow on iteration 7460
Using dynamic loss scale of 1
Grad overflow on iteration 7461
Using dynamic loss scale of 1
Grad overflow on iteration 7462
Using dynamic loss scale of 1
Grad overflow on iteration 7463
Using dynamic loss scale of 1
Grad overflow on iteration 7464
Using dynamic loss scale of 1
Grad overflow on iteration 7465
Using dynamic loss scale of 1
Grad overflow on iteration 7466
Using dynamic loss scale of 1
....
Question 2:When I enable the parameter: ‘’ src/tgt_subword_vocab (The vocab file generated at the same time as the spm (src_subword_model) training.)‘’, the model training will not run, so I’m going to skip using spm as a disambiguation for my own data and choose to train a translation model based on a pre-trained language model (e.g., bert), but I can’t figure out how to do this by looking at the docs. I can’t figure out how to do this, so I’m looking for help! Thanks a million!
After I modified the YAML configuration as follows, the model could run for a longer time with almost no error reported in the logs. However, around step 30,000(My training corpus consists of approximately 2 million parallel sentences.), there was a sudden drop in accuracy, and the cross-entropy (xent) and perplexity (ppl) values became NaN.
Following the suggestion mentioned in FP16 training error · Issue #1645 · OpenNMT/OpenNMT-py · GitHub, I attempted to use FusedAdam instead of Adam as the optimizer. However, the xent and ppl values turned to NaN even faster, and more errors appeared in the logs:
Using dynamic loss scale of 256.0
[2024-01-24 04:36:01,961 INFO] Step 9800/100000; acc: 74.0; ppl: 7.1; xent: 2.0; lr: 0.00089; sents: 448673; bsz: 765/1427/561; 13979/26056 tok/s; 5055 sec;
Grad overflow on iteration 9733
Using dynamic loss scale of 256.0
[2024-01-24 04:36:20,361 INFO] Weighted corpora loaded so far:
* corpus_1: 22
[2024-01-24 04:36:22,080 INFO] Weighted corpora loaded so far:
* corpus_1: 22
[2024-01-24 04:36:27,472 INFO] Weighted corpora loaded so far:
* corpus_1: 22
[2024-01-24 04:36:30,437 INFO] Weighted corpora loaded so far:
* corpus_1: 22
Grad overflow on iteration 9811
Using dynamic loss scale of 128.0
Grad overflow on iteration 9814
Using dynamic loss scale of 64.0
Grad overflow on iteration 9815
Using dynamic loss scale of 32.0
Grad overflow on iteration 9818
Using dynamic loss scale of 16.0
........
........
Grad overflow on iteration 13098
Using dynamic loss scale of 1
Grad overflow on iteration 13099
Using dynamic loss scale of 1
[2024-01-24 04:59:02,938 INFO] Weighted corpora loaded so far:
* corpus_1: 29
[2024-01-24 04:59:07,837 INFO] Weighted corpora loaded so far:
* corpus_1: 29
[2024-01-24 04:59:13,976 INFO] Weighted corpora loaded so far:
* corpus_1: 29
[2024-01-24 04:59:19,269 INFO] Weighted corpora loaded so far:
* corpus_1: 29
[2024-01-24 04:59:22,051 INFO] Step 13200/100000; acc: 75.9; ppl: nan; xent: nan; lr: 0.00077; sents: 443999; bsz: 752/1422/555; 14805/27986 tok/s; 6455 sec;
Grad overflow on iteration 13100
Using dynamic loss scale of 1
#yaml
# TensorBoard parameters
tensorboard: true
tensorboard_log_dir: Train/log/tensorboard_logs
## Where the samples will be written
save_data: Train/vocab/
# Training files
data:
corpus_1:
path_src: TRAIN_DATASET/bo_train.txt
path_tgt: TRAIN_DATASET/zh_train.txt
transforms: [filtertoolong]
valid:
path_src: TRAIN_DATASET/bo_val.txt
path_tgt: TRAIN_DATASET/zh_val.txt
transforms: [filtertoolong]
# Vocabulary files, generated by onmt_build_vocab
src_vocab: Train/vocab/vocab.src
tgt_vocab: Train/vocab/vocab.tgt
share_vocab: True
overwrite: False
# Vocabulary size - should be the same as in sentence piece
src_vocab_size: 72000
tgt_vocab_size: 72000
# Filter out source/target longer than n if [filtertoolong] enabled
src_seq_length: 512
src_seq_length: 512
# Tokenization options
src_subword_model: Train/vocab/tibetan.model
tgt_subword_model: Train/vocab/chinese.model
# Where to save the log file and the output models/checkpoints
log_file: train.log
save_model: Train/run/model.bo-zh
# Stop training if it does not imporve after n validations
early_stopping: 10
# Default: 5000 - Save a model checkpoint for each n
save_checkpoint_steps: 2000
# To save space, limit checkpoints to last n
keep_checkpoint: 5
seed: 2345
# Default: 100000 - Train the model to max n steps
# Increase to 200000 or more for large datasets
# For fine-tuning, add up the required steps to the original steps
train_steps: 100000
# Default: 10000 - Run validation after n steps
valid_steps: 5000
# Default: 4000 - for large datasets, try up to 8000
warmup_steps: 6000
report_every: 100
# Number of GPUs, and IDs of GPUs
world_size: 2
gpu_ranks: [0,1]
# Batching
bucket_size: 262144
num_workers: 2 # Default: 2, set to 0 when RAM out of memory
batch_type: "tokens"
batch_size: 2048 # Tokens per batch, change when CUDA out of memory
valid_batch_size: 1024
max_generator_batches: 2
accum_count: [4]
accum_steps: [0]
# Optimization
model_dtype: "fp16"
optim: "fusedadam"
learning_rate: 0.2
decay_method: "noam"
adam_beta2: 0.998
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
# Model
encoder_type: transformer
decoder_type: transformer
position_encoding: true
enc_layers: 6
dec_layers: 6
heads: 8
hidden_size: 512
word_vec_size: 512
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
Did you try to follow a basic tutorial and see how it runs ?
if not, please do so.
For instance, you seem to be using a separate spm model for your src and tgt languages but at the same time trying to share vocabs / embeddings.
You need to read a bit more about generic stuff and understand otherwise you will waste your time in experiments. Also please do not post everywhere the same question one place is fine, but we can’t just answer right away
I’m sorry for the inconvenience caused by the repeated questions you mentioned; I have already deleted them. I sincerely apologize for any inconvenience caused! Also, regarding the second YAML configuration I provided earlier, there were some issues (which have been corrected). However, even with the revised YAML file or when training with a smaller learning rate, the following issues still occurred multiple times:
Thank you for your response. Lowering the learning rate to 0.2 has kept everything normal so far. However, the issue of “Weighted corpora loaded so far,” as mentioned earlier, still exists. I will evaluate the model’s performance on the test set after the training is complete. Thanks again for your reply.