Finetuning and Curating NLLB-200 with OpenNMT-py

kitkhai · September 1, 2023, 8:39am

Hi

I realised that I don’t quite understand what is token batching in comparison with the more conventional batching? Typically we have sequences of sentences batched together to form a batch. But for token batching, eg 8 token batch size, then what is the length of the sequences of the tokens that are batched together?

Also, why is token batching used/recommended? I don’t quite understand why as I think that by defining a batch by tokens, we could be splitting a sentence up and hence the connections between words in a sentence may be broken? And as such the language model would be able to learn optimally?

ArtanisTheOne · September 1, 2023, 9:33am

Token batching doesn’t split up the sentences itself I believe. It just tries to find the amount of tokens closest to the token batch size you set, that is also a multiple of 8 (traditionally).

Token batching is ‘better’ in this case because it keeps the size of batches more standard. A batch size of 128 sentences would take 128 sentences, regardless of their size. So what can easily happen is the number of tokens in each batch can be wildly different. Token batching fixes that.

i_la_13 · September 18, 2023, 10:33am

Hi Vincent, thank you so much for the tutorial!
I’m trying to finetune NLLB-200 3.3B using LoRa and the training works but when I try to translate some simple sentences then I get “” for all the sentences.
These are my config files for training and inference:
training

# Vocab creation options
share_vocab: true

## Where the vocab(s) is
src_vocab: "dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256206

tgt_vocab: "dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256206

vocab_size_multiple: 1

decoder_start_token: '</s>'


### Transform related opts:

#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0


#Corpus opts:
data:

  corpus_enes:
    path_src: "en-es.en"
    path_tgt: "en-es.es"
    transforms: [sentencepiece, prefix, suffix, filtertoolong]
    src_prefix: "</s> eng_Latn"
    tgt_prefix: "spa_Latn"
    weight: 10
    src_suffix: "" 
    tgt_suffix: ""



#### Filter
src_seq_length: 250
tgt_seq_length: 250


# General opts
update_vocab: true 

train_from: "nllb-200-3.3B-onmt.pt"

reset_optim: all 
save_data: "/nllb-200"
save_model: "trained_models_en_es/nllb-200-en_es"
log_file: "train.log"

keep_checkpoint: -1

save_checkpoint_steps: 5000

average_decay: 0.0005
seed: 1234
report_every: 1
train_steps: 100000 
valid_steps: 5000 

# Batching
bucket_size: 262144
num_workers: 1
prefetch_factor:  400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"

batch_size: 1024                             
valid_batch_size: 1024                        
batch_size_multiple: 2                       
accum_count: [2]

accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "fusedadam" 
learning_rate: 0.1  
warmup_steps: 30 
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
dropout: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

#LoRa
lora_layers: ['linear_values', 'linear_query', 'linnear_keys', 'final_linear']
lora_rank: 8
lora_dropout: 0.05
lora_alpha: 1
lora_embedding: false

# Model
override_opts: true

# For LoRa to work
add_ffnbias: true
add_qkvbias: true

encoder_type: transformer
decoder_type: transformer

enc_layers: 24          
dec_layers: 24          
transformer_ff: 8192    

hidden_size: 2048
word_vec_size: 2048

heads: 16
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

inference:

transforms: [sentencepiece, prefix, suffix]
# nllb-200 specific prefixing and suffixing
src_prefix: "</s> eng_Latn"
tgt_prefix: "spa_Latn" 
tgt_file_prefix: true
src_suffix: ""
tgt_suffix: ""
#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Model info
model: nllb-200-en_es_step_65000.pt
# Inference
max_length: 256
gpu: 0
batch_type: tokens

# for 3,3B model
batch_size: 1024

fp16:
beam_size: 5
report_time: true
log_file: "translate.log"

Do you see anything wrong?

vince62s · September 18, 2023, 11:08am

config needs to be like:

        src_prefix: "eng_Latn"
        tgt_prefix: "deu_Latn"
        src_suffix: "</s>"
        tgt_suffix: ""

When you start logging the ACC/PPL (don’t wait 65000 steps) check that ACC is already very high and PPL low.
if you have a doubt, post the log here of the first 2000 steps

i_la_13 · September 18, 2023, 12:44pm

here is the log of the first 2k steps:

[2023-09-18 11:48:58,798 INFO] Step 100/100000; acc: 76.9; ppl:  13.3; xent: 2.6; lr: 0.00022; sents:    5024; bsz:  810/ 886/25; 838/918 tok/s;    193 sec;
[2023-09-18 11:50:33,613 INFO] Step 200/100000; acc: 77.9; ppl:  12.4; xent: 2.5; lr: 0.00016; sents:    4200; bsz:  785/ 880/21; 1655/1857 tok/s;    288 sec;
[2023-09-18 11:52:07,205 INFO] Step 300/100000; acc: 78.1; ppl:  12.4; xent: 2.5; lr: 0.00013; sents:    4166; bsz:  790/ 881/21; 1688/1884 tok/s;    382 sec;
[2023-09-18 11:53:40,853 INFO] Step 400/100000; acc: 78.3; ppl:  12.4; xent: 2.5; lr: 0.00011; sents:    4618; bsz:  783/ 884/23; 1673/1889 tok/s;    475 sec;
[2023-09-18 11:55:14,938 INFO] Step 500/100000; acc: 80.1; ppl:  11.1; xent: 2.4; lr: 0.00010; sents:    4122; bsz:  787/ 884/21; 1673/1879 tok/s;    569 sec;
[2023-09-18 11:56:48,616 INFO] Step 600/100000; acc: 79.7; ppl:  11.4; xent: 2.4; lr: 0.00009; sents:    4806; bsz:  798/ 891/24; 1704/1902 tok/s;    663 sec;
[2023-09-18 11:58:22,676 INFO] Step 700/100000; acc: 78.8; ppl:  11.9; xent: 2.5; lr: 0.00008; sents:    4456; bsz:  794/ 891/22; 1688/1895 tok/s;    757 sec;
[2023-09-18 11:59:59,265 INFO] Step 800/100000; acc: 79.6; ppl:  11.4; xent: 2.4; lr: 0.00008; sents:    4418; bsz:  792/ 889/22; 1639/1840 tok/s;    854 sec;
[2023-09-18 12:01:32,534 INFO] Step 900/100000; acc: 80.8; ppl:  10.8; xent: 2.4; lr: 0.00007; sents:    4166; bsz:  802/ 898/21; 1719/1926 tok/s;    947 sec;
[2023-09-18 12:03:07,545 INFO] Step 1000/100000; acc: 78.3; ppl:  12.3; xent: 2.5; lr: 0.00007; sents:    4736; bsz:  782/ 882/24; 1647/1856 tok/s;   1042 sec;
[2023-09-18 12:04:40,928 INFO] Step 1100/100000; acc: 78.8; ppl:  12.0; xent: 2.5; lr: 0.00007; sents:    4530; bsz:  801/ 880/23; 1716/1885 tok/s;   1135 sec;
[2023-09-18 12:06:16,862 INFO] Step 1200/100000; acc: 79.8; ppl:  11.3; xent: 2.4; lr: 0.00006; sents:    4004; bsz:  785/ 885/20; 1636/1844 tok/s;   1231 sec;
[2023-09-18 12:07:53,405 INFO] Step 1300/100000; acc: 80.0; ppl:  11.3; xent: 2.4; lr: 0.00006; sents:    4614; bsz:  805/ 896/23; 1669/1855 tok/s;   1328 sec;
[2023-09-18 12:09:30,113 INFO] Step 1400/100000; acc: 78.7; ppl:  11.9; xent: 2.5; lr: 0.00006; sents:    4580; bsz:  785/ 884/23; 1623/1828 tok/s;   1424 sec;
[2023-09-18 12:11:04,188 INFO] Step 1500/100000; acc: 78.4; ppl:  12.2; xent: 2.5; lr: 0.00006; sents:    5078; bsz:  793/ 876/25; 1685/1862 tok/s;   1519 sec;
[2023-09-18 12:12:38,809 INFO] Step 1600/100000; acc: 78.9; ppl:  11.9; xent: 2.5; lr: 0.00006; sents:    4746; bsz:  789/ 882/24; 1668/1864 tok/s;   1613 sec;
[2023-09-18 12:14:13,428 INFO] Step 1700/100000; acc: 79.7; ppl:  11.5; xent: 2.4; lr: 0.00005; sents:    4852; bsz:  801/ 889/24; 1694/1879 tok/s;   1708 sec;
[2023-09-18 12:15:46,100 INFO] Step 1800/100000; acc: 79.7; ppl:  11.5; xent: 2.4; lr: 0.00005; sents:    5330; bsz:  806/ 883/27; 1739/1906 tok/s;   1800 sec;
[2023-09-18 12:17:19,273 INFO] Step 1900/100000; acc: 79.4; ppl:  11.6; xent: 2.5; lr: 0.00005; sents:    4762; bsz:  799/ 884/24; 1714/1897 tok/s;   1894 sec;
[2023-09-18 12:18:53,349 INFO] Step 2000/100000; acc: 78.1; ppl:  12.3; xent: 2.5; lr: 0.00005; sents:    4376; bsz:  777/ 871/22; 1652/1852 tok/s;   1988 sec;

vince62s · September 18, 2023, 1:29pm

Looks great.
Then do you merge the saved checkpoint with original? Works fine?
Run inference on merged ckpt

i_la_13 · September 18, 2023, 4:20pm

ok, now it’s working. I think it was not working because the prefix/suffix format. Thank you so much Vincent!!

zszsz · September 19, 2023, 8:11am

Hi Vencent,
Thanks a lot for the great work. I need to finetuning the model from EN to multiple languages, e.g. ZH, FR, DE and PT in a sepcific domain, can I list all the language pair in the nllb-train.yaml file (see bellow)? Before the training, do I need to do any pre-process job to the data? (such as apply sentencepiece model to the corpus)
enzh:
path_src: “/en-zh/train.en”
path_tgt: “/en-zh/train.zh”
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: “eng_Latn”
tgt_prefix: “zho_Hans”
src_suffix: “”
tgt_suffix: “”
enfr:
path_src: “/en-fr/train.en”
path_tgt: “/en-fr/train.fr”
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: “eng_Latn”
tgt_prefix: “fra_Latn”
src_suffix: “”
tgt_suffix: “”

mick · October 26, 2023, 7:47pm

Hello!
I’m having trouble finetuning NLLB-200 1.3B on multiple GPUS. And I’d like to ask wether I am missing something

I get AssertionError: An error in model’s partition and checkpoint’s slice was detected when training with 4 GPUs.

If I run the finetuning with only a single GPU everything seems fine

my train.yaml is the following:

# Vocab creation options
share_vocab: true

## Where the vocab(s) is
src_vocab: "pretrained_model/dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256206

tgt_vocab: "pretrained_model/dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256206

vocab_size_multiple: 1

decoder_start_token: '</s>'


### Transform related opts:

#### Subword
src_subword_model: "pretrained_model/flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "pretrained_model/flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0


#Corpus opts:
data:

  corpus_enar:
    path_src: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/crp-biling.ar-en.en"
    path_tgt: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/crp-biling.ar-en.ar"
    transforms: [sentencepiece, prefix, suffix, filtertoolong]
    src_prefix: "eng_Latn"
    tgt_prefix: "arb_Arab"
    weight: 1
    src_suffix: "</s>" 
    tgt_suffix: ""
  valid:
    path_src: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/en.dev" 
    path_tgt: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/ar.dev"
    transforms: [sentencepiece, prefix, suffix]
    src_prefix: "eng_Latn"
    tgt_prefix: "arb_Arab"




#### Filter
src_seq_length: 250
tgt_seq_length: 250


# General opts
update_vocab: true 

train_from: "pretrained_model/nllb-200-1.3B-onmt.pt"

reset_optim: all 
save_data: "nllb200-1.3_en_ar/data"
save_model: "nllb200-1.3_en_ar/nllb-200-en_ar"
log_file: "train.log"

keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 2000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 4
gpu_ranks: [0,1,2,3]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 1e-4
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

I am using OpenNMT-py 3.4.0 and pytorch 2.0.1 with cuda 11.4.
Any suggestions?
Thanks in advance !

kitkhai · February 13, 2024, 10:05am

Hi @vince62s,

I saw the tutorial on multi-way training (training in multiple language directions).

Can this be applied to NLLB? What do I need to change in the files etc to do multi-way training for NLLB?

avibrantsoul · February 19, 2024, 10:15am

Hi,

Thanks for this tutorial. I need your help.

I downloaded the checkpoints and the spm model from the given links in this thread. I ran the same inference script given in this tutorial on English FLORES data. However, I get the same output (4096 characters) for all input sentences.

ho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hans.........zho_Hans

Following are the links to the spm model and the NLLB checkpoint.
spm: https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model
checkpoint: https://s3.amazonaws.com/opennmt-models/nllb-200/nllb-200-1.3Bdst-onmt.pt

I tried using other nllb variants as well. But the output from every model is garbage. Similarly, thinking my data had some issue, I tried different files, but the outcome is the same.

opennmt-py version: 3.3.0

What could be the reason?

Edit: I upgraded opennmt-py to the latest version (3.4.3) and now this problem is not there anymore.

However, now each output starts with “??”

⁇ 们现在有4个月大的"以前患有糖尿病的非糖尿病小鼠",他补充说.

BLEU (computed using sacrebleu) on FORES is 1.29.

sacrebleu zho_Hans.devtest -i flores_nllb_13bdst.enzh -m bleu -w 4

Is it a normal behaviour?

vince62s · February 20, 2024, 11:28am

please share your yaml file.

WeiRiWa · August 6, 2024, 10:03pm

Hello, I am a newbie and I want to use a new language to fine-tune the NLLB-200 model, but I have seen the OpenMTT-PY project and I am not sure how to use it. Then, is the extended vocabulary separate from the OpenNMT project? Looking forward to your reply!

InitialState9 · April 3, 2025, 6:11am

Hi, Thanks for your share! But anyone met the Catastrophic forgetting problem?
At the fitst step, I fine-tuned en-zh with 1M sentences, and I found the en-zh BLEU score has improved a lot, however, other languages had droped a lot either, and some of them output very strange words in other languages.

After that
I fine-tuned nllb-600M with multi-language, en-zh, en-fr, en-ar …
It still came up the same problem, do you have any suggestion ?
Thank you!

ArtanisTheOne · April 14, 2025, 1:55pm

Yeah, if you expect the model to retain its functionality with certain pairs you need to have them in your finetuning data, not to the same extent as what you’re mainly training on, but keeping a small subset for lang pairs you want to keep the quality mainly the same will help.

Unfortunately, improving one language pair usually comes at the cost of others unless you balance the data across all the ones you care about. Pairs which you don’t add data for will catastrophically forget.

alexeir · April 23, 2025, 8:52am

We faced the same issues. That’s why we decided to make separate small models of 110mb for each language pair with conversion via English language. By the way we are happy with results.

hui.li · June 20, 2025, 12:49pm

HI @vince62s
Can I retrain an NLLB model without fine-tuning? Is this the same as what NLLB mentioned: add the language before src, add after it, and add the language before tgt to train a multi-language?