kitkhai
(Kitkhai)
September 1, 2023, 8:39am
64
Hi
I realised that I don’t quite understand what is token batching in comparison with the more conventional batching? Typically we have sequences of sentences batched together to form a batch. But for token batching, eg 8 token batch size, then what is the length of the sequences of the tokens that are batched together?
Also, why is token batching used/recommended? I don’t quite understand why as I think that by defining a batch by tokens, we could be splitting a sentence up and hence the connections between words in a sentence may be broken? And as such the language model would be able to learn optimally?
Token batching doesn’t split up the sentences itself I believe. It just tries to find the amount of tokens closest to the token batch size you set, that is also a multiple of 8 (traditionally).
Token batching is ‘better’ in this case because it keeps the size of batches more standard. A batch size of 128 sentences would take 128 sentences, regardless of their size. So what can easily happen is the number of tokens in each batch can be wildly different. Token batching fixes that.
i_la_13
(Ivan)
September 18, 2023, 10:33am
67
Hi Vincent, thank you so much for the tutorial!
I’m trying to finetune NLLB-200 3.3B using LoRa and the training works but when I try to translate some simple sentences then I get “” for all the sentences.
These are my config files for training and inference:
training
# Vocab creation options
share_vocab: true
## Where the vocab(s) is
src_vocab: "dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256206
tgt_vocab: "dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256206
vocab_size_multiple: 1
decoder_start_token: '</s>'
### Transform related opts:
#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
#Corpus opts:
data:
corpus_enes:
path_src: "en-es.en"
path_tgt: "en-es.es"
transforms: [sentencepiece, prefix, suffix, filtertoolong]
src_prefix: "</s> eng_Latn"
tgt_prefix: "spa_Latn"
weight: 10
src_suffix: ""
tgt_suffix: ""
#### Filter
src_seq_length: 250
tgt_seq_length: 250
# General opts
update_vocab: true
train_from: "nllb-200-3.3B-onmt.pt"
reset_optim: all
save_data: "/nllb-200"
save_model: "trained_models_en_es/nllb-200-en_es"
log_file: "train.log"
keep_checkpoint: -1
save_checkpoint_steps: 5000
average_decay: 0.0005
seed: 1234
report_every: 1
train_steps: 100000
valid_steps: 5000
# Batching
bucket_size: 262144
num_workers: 1
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 1024
valid_batch_size: 1024
batch_size_multiple: 2
accum_count: [2]
accum_steps: [0]
# Optimization
model_dtype: "fp16"
optim: "fusedadam"
learning_rate: 0.1
warmup_steps: 30
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
dropout: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
#LoRa
lora_layers: ['linear_values', 'linear_query', 'linnear_keys', 'final_linear']
lora_rank: 8
lora_dropout: 0.05
lora_alpha: 1
lora_embedding: false
# Model
override_opts: true
# For LoRa to work
add_ffnbias: true
add_qkvbias: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
transformer_ff: 8192
hidden_size: 2048
word_vec_size: 2048
heads: 16
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'
inference:
transforms: [sentencepiece, prefix, suffix]
# nllb-200 specific prefixing and suffixing
src_prefix: "</s> eng_Latn"
tgt_prefix: "spa_Latn"
tgt_file_prefix: true
src_suffix: ""
tgt_suffix: ""
#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Model info
model: nllb-200-en_es_step_65000.pt
# Inference
max_length: 256
gpu: 0
batch_type: tokens
# for 3,3B model
batch_size: 1024
fp16:
beam_size: 5
report_time: true
log_file: "translate.log"
Do you see anything wrong?
vince62s
(Vincent Nguyen)
September 18, 2023, 11:08am
68
config needs to be like:
src_prefix: "eng_Latn"
tgt_prefix: "deu_Latn"
src_suffix: "</s>"
tgt_suffix: ""
When you start logging the ACC/PPL (don’t wait 65000 steps) check that ACC is already very high and PPL low.
if you have a doubt, post the log here of the first 2000 steps
i_la_13
(Ivan)
September 18, 2023, 12:44pm
69
here is the log of the first 2k steps:
[2023-09-18 11:48:58,798 INFO] Step 100/100000; acc: 76.9; ppl: 13.3; xent: 2.6; lr: 0.00022; sents: 5024; bsz: 810/ 886/25; 838/918 tok/s; 193 sec;
[2023-09-18 11:50:33,613 INFO] Step 200/100000; acc: 77.9; ppl: 12.4; xent: 2.5; lr: 0.00016; sents: 4200; bsz: 785/ 880/21; 1655/1857 tok/s; 288 sec;
[2023-09-18 11:52:07,205 INFO] Step 300/100000; acc: 78.1; ppl: 12.4; xent: 2.5; lr: 0.00013; sents: 4166; bsz: 790/ 881/21; 1688/1884 tok/s; 382 sec;
[2023-09-18 11:53:40,853 INFO] Step 400/100000; acc: 78.3; ppl: 12.4; xent: 2.5; lr: 0.00011; sents: 4618; bsz: 783/ 884/23; 1673/1889 tok/s; 475 sec;
[2023-09-18 11:55:14,938 INFO] Step 500/100000; acc: 80.1; ppl: 11.1; xent: 2.4; lr: 0.00010; sents: 4122; bsz: 787/ 884/21; 1673/1879 tok/s; 569 sec;
[2023-09-18 11:56:48,616 INFO] Step 600/100000; acc: 79.7; ppl: 11.4; xent: 2.4; lr: 0.00009; sents: 4806; bsz: 798/ 891/24; 1704/1902 tok/s; 663 sec;
[2023-09-18 11:58:22,676 INFO] Step 700/100000; acc: 78.8; ppl: 11.9; xent: 2.5; lr: 0.00008; sents: 4456; bsz: 794/ 891/22; 1688/1895 tok/s; 757 sec;
[2023-09-18 11:59:59,265 INFO] Step 800/100000; acc: 79.6; ppl: 11.4; xent: 2.4; lr: 0.00008; sents: 4418; bsz: 792/ 889/22; 1639/1840 tok/s; 854 sec;
[2023-09-18 12:01:32,534 INFO] Step 900/100000; acc: 80.8; ppl: 10.8; xent: 2.4; lr: 0.00007; sents: 4166; bsz: 802/ 898/21; 1719/1926 tok/s; 947 sec;
[2023-09-18 12:03:07,545 INFO] Step 1000/100000; acc: 78.3; ppl: 12.3; xent: 2.5; lr: 0.00007; sents: 4736; bsz: 782/ 882/24; 1647/1856 tok/s; 1042 sec;
[2023-09-18 12:04:40,928 INFO] Step 1100/100000; acc: 78.8; ppl: 12.0; xent: 2.5; lr: 0.00007; sents: 4530; bsz: 801/ 880/23; 1716/1885 tok/s; 1135 sec;
[2023-09-18 12:06:16,862 INFO] Step 1200/100000; acc: 79.8; ppl: 11.3; xent: 2.4; lr: 0.00006; sents: 4004; bsz: 785/ 885/20; 1636/1844 tok/s; 1231 sec;
[2023-09-18 12:07:53,405 INFO] Step 1300/100000; acc: 80.0; ppl: 11.3; xent: 2.4; lr: 0.00006; sents: 4614; bsz: 805/ 896/23; 1669/1855 tok/s; 1328 sec;
[2023-09-18 12:09:30,113 INFO] Step 1400/100000; acc: 78.7; ppl: 11.9; xent: 2.5; lr: 0.00006; sents: 4580; bsz: 785/ 884/23; 1623/1828 tok/s; 1424 sec;
[2023-09-18 12:11:04,188 INFO] Step 1500/100000; acc: 78.4; ppl: 12.2; xent: 2.5; lr: 0.00006; sents: 5078; bsz: 793/ 876/25; 1685/1862 tok/s; 1519 sec;
[2023-09-18 12:12:38,809 INFO] Step 1600/100000; acc: 78.9; ppl: 11.9; xent: 2.5; lr: 0.00006; sents: 4746; bsz: 789/ 882/24; 1668/1864 tok/s; 1613 sec;
[2023-09-18 12:14:13,428 INFO] Step 1700/100000; acc: 79.7; ppl: 11.5; xent: 2.4; lr: 0.00005; sents: 4852; bsz: 801/ 889/24; 1694/1879 tok/s; 1708 sec;
[2023-09-18 12:15:46,100 INFO] Step 1800/100000; acc: 79.7; ppl: 11.5; xent: 2.4; lr: 0.00005; sents: 5330; bsz: 806/ 883/27; 1739/1906 tok/s; 1800 sec;
[2023-09-18 12:17:19,273 INFO] Step 1900/100000; acc: 79.4; ppl: 11.6; xent: 2.5; lr: 0.00005; sents: 4762; bsz: 799/ 884/24; 1714/1897 tok/s; 1894 sec;
[2023-09-18 12:18:53,349 INFO] Step 2000/100000; acc: 78.1; ppl: 12.3; xent: 2.5; lr: 0.00005; sents: 4376; bsz: 777/ 871/22; 1652/1852 tok/s; 1988 sec;
vince62s
(Vincent Nguyen)
September 18, 2023, 1:29pm
70
Looks great.
Then do you merge the saved checkpoint with original? Works fine?
Run inference on merged ckpt
i_la_13
(Ivan)
September 18, 2023, 4:20pm
71
ok, now it’s working. I think it was not working because the prefix/suffix format. Thank you so much Vincent!!
1 Like
zszsz
(Alex.Z)
September 19, 2023, 8:11am
72
Hi Vencent,
Thanks a lot for the great work. I need to finetuning the model from EN to multiple languages, e.g. ZH, FR, DE and PT in a sepcific domain, can I list all the language pair in the nllb-train.yaml file (see bellow)? Before the training, do I need to do any pre-process job to the data? (such as apply sentencepiece model to the corpus)
enzh:
path_src: “/en-zh/train.en”
path_tgt: “/en-zh/train.zh”
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: “eng_Latn”
tgt_prefix: “zho_Hans”
src_suffix: “”
tgt_suffix: “”
enfr:
path_src: “/en-fr/train.en”
path_tgt: “/en-fr/train.fr”
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: “eng_Latn”
tgt_prefix: “fra_Latn”
src_suffix: “”
tgt_suffix: “”
mick
October 26, 2023, 7:47pm
73
Hello!
I’m having trouble finetuning NLLB-200 1.3B on multiple GPUS. And I’d like to ask wether I am missing something
I get AssertionError: An error in model’s partition and checkpoint’s slice was detected when training with 4 GPUs.
If I run the finetuning with only a single GPU everything seems fine
my train.yaml is the following:
# Vocab creation options
share_vocab: true
## Where the vocab(s) is
src_vocab: "pretrained_model/dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256206
tgt_vocab: "pretrained_model/dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256206
vocab_size_multiple: 1
decoder_start_token: '</s>'
### Transform related opts:
#### Subword
src_subword_model: "pretrained_model/flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "pretrained_model/flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
#Corpus opts:
data:
corpus_enar:
path_src: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/crp-biling.ar-en.en"
path_tgt: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/crp-biling.ar-en.ar"
transforms: [sentencepiece, prefix, suffix, filtertoolong]
src_prefix: "eng_Latn"
tgt_prefix: "arb_Arab"
weight: 1
src_suffix: "</s>"
tgt_suffix: ""
valid:
path_src: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/en.dev"
path_tgt: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/ar.dev"
transforms: [sentencepiece, prefix, suffix]
src_prefix: "eng_Latn"
tgt_prefix: "arb_Arab"
#### Filter
src_seq_length: 250
tgt_seq_length: 250
# General opts
update_vocab: true
train_from: "pretrained_model/nllb-200-1.3B-onmt.pt"
reset_optim: all
save_data: "nllb200-1.3_en_ar/data"
save_model: "nllb200-1.3_en_ar/nllb-200-en_ar"
log_file: "train.log"
keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 2000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 4
gpu_ranks: [0,1,2,3]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 1e-4
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"
# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'
I am using OpenNMT-py 3.4.0 and pytorch 2.0.1 with cuda 11.4.
Any suggestions?
Thanks in advance !
kitkhai
(Kitkhai)
February 13, 2024, 10:05am
74
Hi @vince62s ,
I saw the tutorial on multi-way training (training in multiple language directions).
Can this be applied to NLLB? What do I need to change in the files etc to do multi-way training for NLLB?
Hi,
Thanks for this tutorial. I need your help.
I downloaded the checkpoints and the spm model from the given links in this thread. I ran the same inference script given in this tutorial on English FLORES data. However, I get the same output (4096 characters) for all input sentences.
ho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hans.........zho_Hans
Following are the links to the spm model and the NLLB checkpoint.
spm: https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model
checkpoint: https://s3.amazonaws.com/opennmt-models/nllb-200/nllb-200-1.3Bdst-onmt.pt
I tried using other nllb variants as well. But the output from every model is garbage. Similarly, thinking my data had some issue, I tried different files, but the outcome is the same.
opennmt-py version: 3.3.0
What could be the reason?
Edit: I upgraded opennmt-py to the latest version (3.4.3) and now this problem is not there anymore.
However, now each output starts with “??”
⁇ 们现在有4个月大的"以前患有糖尿病的非糖尿病小鼠",他补充说.
BLEU (computed using sacrebleu) on FORES is 1.29.
sacrebleu zho_Hans.devtest -i flores_nllb_13bdst.enzh -m bleu -w 4
Is it a normal behaviour?
vince62s
(Vincent Nguyen)
February 20, 2024, 11:28am
76
please share your yaml file.
WeiRiWa
(wanma Weiriwanma)
August 6, 2024, 10:03pm
77
Hello, I am a newbie and I want to use a new language to fine-tune the NLLB-200 model, but I have seen the OpenMTT-PY project and I am not sure how to use it. Then, is the extended vocabulary separate from the OpenNMT project? Looking forward to your reply!