Finetuning and Curating NLLB-200 with OpenNMT-py

What checkpoint are you talking about?

I am using the nllb-200-600M-onmt.pt checkpoint from the s3 server.

then you need:
enc_layers: 12
dec_layers: 12
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 4096

Thank you so much! It worked! How may I retrieve such information about the model architecture if I have to finetune another model checkpoint in the future?

Now I get another error message

(/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ')

that seems to be because I was loading too much data at once on my collab notebook, so I think I’ll reduce the amount of data that use (600K) to around 300K

Hi again @vince62s

I saw that you used 341K lines so I tried using 300K lines of training data and subsequently 10 lines of training data. However, I was still thrown the same error, even after reduce the batch size to 1. I am using Google collab that provides around 12GB System RAM & 15GB GPU RAM.

(/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ')

My google search attributes the error message to running out of memory (my system ram seems to be the problem, not the GPU ram) but I’m not sure what else I can change. Currently my batching and optimisation configuration is as follows:

# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 1
valid_batch_size: 1
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]

# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

batch_size: 1 and batch_type: tokens mean that you use batches of 1 token, non sense.
also don’t use sgd, use adam with a lr of 1e-4

Hi @vince62s

The NLLB 200 600M - Transformer checkpoint that I downloaded from OpenNMT-py models - OpenNMT gave me really weird results.

When I ran:

python3 ~/OpenNMT-py/translate.py --config nllb-inference.yaml -src /en-zh/testsets/newstest2019-enzh-src.en -output newstest2019-enzh-hyp.zh

My input English sentence:

My uncle saw that the eagle caught the chickens

My model output in Chinese (Simplified) was complete gibberish and repetitive:

现在,我知道这个问题是什么,我知道这个问题是什么,我认为这是什么.

I used the exact same inference yaml file and only chance the reference to the model checkpoint, hence I am really confused what went wrong.


You must be using master and not the last pip version. I will push a fix for this.

You can git pull and try again.

1 Like

Hi

I realised that I don’t quite understand what is token batching in comparison with the more conventional batching? Typically we have sequences of sentences batched together to form a batch. But for token batching, eg 8 token batch size, then what is the length of the sequences of the tokens that are batched together?

Also, why is token batching used/recommended? I don’t quite understand why as I think that by defining a batch by tokens, we could be splitting a sentence up and hence the connections between words in a sentence may be broken? And as such the language model would be able to learn optimally?

Token batching doesn’t split up the sentences itself I believe. It just tries to find the amount of tokens closest to the token batch size you set, that is also a multiple of 8 (traditionally).

Token batching is ‘better’ in this case because it keeps the size of batches more standard. A batch size of 128 sentences would take 128 sentences, regardless of their size. So what can easily happen is the number of tokens in each batch can be wildly different. Token batching fixes that.

Hi Vincent, thank you so much for the tutorial!
I’m trying to finetune NLLB-200 3.3B using LoRa and the training works but when I try to translate some simple sentences then I get “” for all the sentences.
These are my config files for training and inference:
training

# Vocab creation options
share_vocab: true

## Where the vocab(s) is
src_vocab: "dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256206

tgt_vocab: "dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256206

vocab_size_multiple: 1

decoder_start_token: '</s>'


### Transform related opts:

#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0


#Corpus opts:
data:

  corpus_enes:
    path_src: "en-es.en"
    path_tgt: "en-es.es"
    transforms: [sentencepiece, prefix, suffix, filtertoolong]
    src_prefix: "</s> eng_Latn"
    tgt_prefix: "spa_Latn"
    weight: 10
    src_suffix: "" 
    tgt_suffix: ""



#### Filter
src_seq_length: 250
tgt_seq_length: 250


# General opts
update_vocab: true 

train_from: "nllb-200-3.3B-onmt.pt"

reset_optim: all 
save_data: "/nllb-200"
save_model: "trained_models_en_es/nllb-200-en_es"
log_file: "train.log"

keep_checkpoint: -1

save_checkpoint_steps: 5000

average_decay: 0.0005
seed: 1234
report_every: 1
train_steps: 100000 
valid_steps: 5000 

# Batching
bucket_size: 262144
num_workers: 1
prefetch_factor:  400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"

batch_size: 1024                             
valid_batch_size: 1024                        
batch_size_multiple: 2                       
accum_count: [2]

accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "fusedadam" 
learning_rate: 0.1  
warmup_steps: 30 
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
dropout: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

#LoRa
lora_layers: ['linear_values', 'linear_query', 'linnear_keys', 'final_linear']
lora_rank: 8
lora_dropout: 0.05
lora_alpha: 1
lora_embedding: false

# Model
override_opts: true

# For LoRa to work
add_ffnbias: true
add_qkvbias: true

encoder_type: transformer
decoder_type: transformer

enc_layers: 24          
dec_layers: 24          
transformer_ff: 8192    

hidden_size: 2048
word_vec_size: 2048

heads: 16
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

inference:

transforms: [sentencepiece, prefix, suffix]
# nllb-200 specific prefixing and suffixing
src_prefix: "</s> eng_Latn"
tgt_prefix: "spa_Latn" 
tgt_file_prefix: true
src_suffix: ""
tgt_suffix: ""
#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Model info
model: nllb-200-en_es_step_65000.pt
# Inference
max_length: 256
gpu: 0
batch_type: tokens

# for 3,3B model
batch_size: 1024

fp16:
beam_size: 5
report_time: true
log_file: "translate.log"

Do you see anything wrong?

config needs to be like:

        src_prefix: "eng_Latn"
        tgt_prefix: "deu_Latn"
        src_suffix: "</s>"
        tgt_suffix: ""

When you start logging the ACC/PPL (don’t wait 65000 steps) check that ACC is already very high and PPL low.
if you have a doubt, post the log here of the first 2000 steps

here is the log of the first 2k steps:

[2023-09-18 11:48:58,798 INFO] Step 100/100000; acc: 76.9; ppl:  13.3; xent: 2.6; lr: 0.00022; sents:    5024; bsz:  810/ 886/25; 838/918 tok/s;    193 sec;
[2023-09-18 11:50:33,613 INFO] Step 200/100000; acc: 77.9; ppl:  12.4; xent: 2.5; lr: 0.00016; sents:    4200; bsz:  785/ 880/21; 1655/1857 tok/s;    288 sec;
[2023-09-18 11:52:07,205 INFO] Step 300/100000; acc: 78.1; ppl:  12.4; xent: 2.5; lr: 0.00013; sents:    4166; bsz:  790/ 881/21; 1688/1884 tok/s;    382 sec;
[2023-09-18 11:53:40,853 INFO] Step 400/100000; acc: 78.3; ppl:  12.4; xent: 2.5; lr: 0.00011; sents:    4618; bsz:  783/ 884/23; 1673/1889 tok/s;    475 sec;
[2023-09-18 11:55:14,938 INFO] Step 500/100000; acc: 80.1; ppl:  11.1; xent: 2.4; lr: 0.00010; sents:    4122; bsz:  787/ 884/21; 1673/1879 tok/s;    569 sec;
[2023-09-18 11:56:48,616 INFO] Step 600/100000; acc: 79.7; ppl:  11.4; xent: 2.4; lr: 0.00009; sents:    4806; bsz:  798/ 891/24; 1704/1902 tok/s;    663 sec;
[2023-09-18 11:58:22,676 INFO] Step 700/100000; acc: 78.8; ppl:  11.9; xent: 2.5; lr: 0.00008; sents:    4456; bsz:  794/ 891/22; 1688/1895 tok/s;    757 sec;
[2023-09-18 11:59:59,265 INFO] Step 800/100000; acc: 79.6; ppl:  11.4; xent: 2.4; lr: 0.00008; sents:    4418; bsz:  792/ 889/22; 1639/1840 tok/s;    854 sec;
[2023-09-18 12:01:32,534 INFO] Step 900/100000; acc: 80.8; ppl:  10.8; xent: 2.4; lr: 0.00007; sents:    4166; bsz:  802/ 898/21; 1719/1926 tok/s;    947 sec;
[2023-09-18 12:03:07,545 INFO] Step 1000/100000; acc: 78.3; ppl:  12.3; xent: 2.5; lr: 0.00007; sents:    4736; bsz:  782/ 882/24; 1647/1856 tok/s;   1042 sec;
[2023-09-18 12:04:40,928 INFO] Step 1100/100000; acc: 78.8; ppl:  12.0; xent: 2.5; lr: 0.00007; sents:    4530; bsz:  801/ 880/23; 1716/1885 tok/s;   1135 sec;
[2023-09-18 12:06:16,862 INFO] Step 1200/100000; acc: 79.8; ppl:  11.3; xent: 2.4; lr: 0.00006; sents:    4004; bsz:  785/ 885/20; 1636/1844 tok/s;   1231 sec;
[2023-09-18 12:07:53,405 INFO] Step 1300/100000; acc: 80.0; ppl:  11.3; xent: 2.4; lr: 0.00006; sents:    4614; bsz:  805/ 896/23; 1669/1855 tok/s;   1328 sec;
[2023-09-18 12:09:30,113 INFO] Step 1400/100000; acc: 78.7; ppl:  11.9; xent: 2.5; lr: 0.00006; sents:    4580; bsz:  785/ 884/23; 1623/1828 tok/s;   1424 sec;
[2023-09-18 12:11:04,188 INFO] Step 1500/100000; acc: 78.4; ppl:  12.2; xent: 2.5; lr: 0.00006; sents:    5078; bsz:  793/ 876/25; 1685/1862 tok/s;   1519 sec;
[2023-09-18 12:12:38,809 INFO] Step 1600/100000; acc: 78.9; ppl:  11.9; xent: 2.5; lr: 0.00006; sents:    4746; bsz:  789/ 882/24; 1668/1864 tok/s;   1613 sec;
[2023-09-18 12:14:13,428 INFO] Step 1700/100000; acc: 79.7; ppl:  11.5; xent: 2.4; lr: 0.00005; sents:    4852; bsz:  801/ 889/24; 1694/1879 tok/s;   1708 sec;
[2023-09-18 12:15:46,100 INFO] Step 1800/100000; acc: 79.7; ppl:  11.5; xent: 2.4; lr: 0.00005; sents:    5330; bsz:  806/ 883/27; 1739/1906 tok/s;   1800 sec;
[2023-09-18 12:17:19,273 INFO] Step 1900/100000; acc: 79.4; ppl:  11.6; xent: 2.5; lr: 0.00005; sents:    4762; bsz:  799/ 884/24; 1714/1897 tok/s;   1894 sec;
[2023-09-18 12:18:53,349 INFO] Step 2000/100000; acc: 78.1; ppl:  12.3; xent: 2.5; lr: 0.00005; sents:    4376; bsz:  777/ 871/22; 1652/1852 tok/s;   1988 sec;

Looks great.
Then do you merge the saved checkpoint with original? Works fine?
Run inference on merged ckpt

ok, now it’s working. I think it was not working because the prefix/suffix format. Thank you so much Vincent!!

1 Like

Hi Vencent,
Thanks a lot for the great work. I need to finetuning the model from EN to multiple languages, e.g. ZH, FR, DE and PT in a sepcific domain, can I list all the language pair in the nllb-train.yaml file (see bellow)? Before the training, do I need to do any pre-process job to the data? (such as apply sentencepiece model to the corpus)
enzh:
path_src: “/en-zh/train.en”
path_tgt: “/en-zh/train.zh”
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: “eng_Latn”
tgt_prefix: “zho_Hans”
src_suffix: “”
tgt_suffix: “”
enfr:
path_src: “/en-fr/train.en”
path_tgt: “/en-fr/train.fr”
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: “eng_Latn”
tgt_prefix: “fra_Latn”
src_suffix: “”
tgt_suffix: “”

Hello!
I’m having trouble finetuning NLLB-200 1.3B on multiple GPUS. And I’d like to ask wether I am missing something

I get AssertionError: An error in model’s partition and checkpoint’s slice was detected when training with 4 GPUs.

If I run the finetuning with only a single GPU everything seems fine

my train.yaml is the following:

# Vocab creation options
share_vocab: true

## Where the vocab(s) is
src_vocab: "pretrained_model/dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256206

tgt_vocab: "pretrained_model/dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256206

vocab_size_multiple: 1

decoder_start_token: '</s>'


### Transform related opts:

#### Subword
src_subword_model: "pretrained_model/flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "pretrained_model/flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0


#Corpus opts:
data:

  corpus_enar:
    path_src: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/crp-biling.ar-en.en"
    path_tgt: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/crp-biling.ar-en.ar"
    transforms: [sentencepiece, prefix, suffix, filtertoolong]
    src_prefix: "eng_Latn"
    tgt_prefix: "arb_Arab"
    weight: 1
    src_suffix: "</s>" 
    tgt_suffix: ""
  valid:
    path_src: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/en.dev" 
    path_tgt: "/home/m.resta/eurovox_train/train_corpora/multi4-v2/finetune/ar.dev"
    transforms: [sentencepiece, prefix, suffix]
    src_prefix: "eng_Latn"
    tgt_prefix: "arb_Arab"




#### Filter
src_seq_length: 250
tgt_seq_length: 250


# General opts
update_vocab: true 

train_from: "pretrained_model/nllb-200-1.3B-onmt.pt"

reset_optim: all 
save_data: "nllb200-1.3_en_ar/data"
save_model: "nllb200-1.3_en_ar/nllb-200-en_ar"
log_file: "train.log"

keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 2000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 4
gpu_ranks: [0,1,2,3]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "adam"
learning_rate: 1e-4
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

I am using OpenNMT-py 3.4.0 and pytorch 2.0.1 with cuda 11.4.
Any suggestions?
Thanks in advance !

Hi @vince62s,

I saw the tutorial on multi-way training (training in multiple language directions).

Can this be applied to NLLB? What do I need to change in the files etc to do multi-way training for NLLB?

Hi,

Thanks for this tutorial. I need your help.

I downloaded the checkpoints and the spm model from the given links in this thread. I ran the same inference script given in this tutorial on English FLORES data. However, I get the same output (4096 characters) for all input sentences.

ho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hanszho_Hans.........zho_Hans

Following are the links to the spm model and the NLLB checkpoint.
spm: https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model
checkpoint: https://s3.amazonaws.com/opennmt-models/nllb-200/nllb-200-1.3Bdst-onmt.pt

I tried using other nllb variants as well. But the output from every model is garbage. Similarly, thinking my data had some issue, I tried different files, but the outcome is the same.

opennmt-py version: 3.3.0

What could be the reason?

Edit: I upgraded opennmt-py to the latest version (3.4.3) and now this problem is not there anymore.

However, now each output starts with “??”

⁇ 们现在有4个月大的"以前患有糖尿病的非糖尿病小鼠",他补充说.

BLEU (computed using sacrebleu) on FORES is 1.29.

sacrebleu zho_Hans.devtest -i flores_nllb_13bdst.enzh -m bleu -w 4

Is it a normal behaviour?

please share your yaml file.