Finetuning and Curating NLLB-200 with OpenNMT-py

As mentioned in other posts NLLB-200 has been great for language coverage, close to SOTA for some pairs but quite poor on some others.

One very specific issue is that the dictionary is incomplete for Chinese (Han characters), there are at least 26 very common characters missing.

In this tutorial, we will explain how to fine tune and even update the vocabulary.

First, let’s look at the issue.

As a reminder, we need a specific config file to run inference in OpenNMT-py. Let’s name this config file nllb-inference.yaml.

transforms: [sentencepiece, prefix, suffix]
# nllb-200 specific prefixing and suffixing
src_prefix: "eng_Latn"
tgt_prefix: "zho_Hans"
tgt_file_prefix: true
src_suffix: "</s>"
tgt_suffix: ""

#### Subword
src_subword_model: "/nllb-200/flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "/nllb-200/flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Model info
model: "/nllb-200/nllb-200-1.3Bdst-onmt.pt"
# Inference
max_length: 512
gpu: 0
batch_type: tokens
batch_size: 2048
fp16:
beam_size: 5
report_time: true

So one prerequisite is that you download the SentencePiece model and the converted checkpoint from our S3 server.

Then you can run:

python3 ~/OpenNMT-py/translate.py --config nllb-inference.yaml -src /en-zh/testsets/newstest2019-enzh-src.en -output newstest2019-enzh-hyp.zh

Next we score:

sacrebleu /en-zh/testsets/newstest2019-enzh-ref.zh -m bleu -l en-zh -i newstest2019-enzh-hyp.zh

BLEU: 23

This is quite poor, for several reasons:

  1. As said before some characters are missing in the vocabulary and the SP model
  2. We used the 1.3B (distilled model) which is not as good as 3.3B or 54B

As a reference, SOTA is 42 and Online tools are (were at WMT19) in the range of 30-32.

Let’s curate this !

Step 1: We need to adapt the dictionary and the SentencePiece model.

When training or finetuning we need a vocab file, in the case of NLLB-200 we adapted the dictionary.txt file available on our S3 server. We added the first 4 tokens which are in different order compared to OpenNMT-py default. The beginning of the file looks like:

<s> 1
<blank> 1
</s> 1
<unk> 1
an 1
▁n 1
▁m 1

and the end of the file looks like:

ydd_Hebr 1
yor_Latn 1
yue_Hant 1
zho_Hans 1
zho_Hant 1
zul_Latn 1
<pad1> 1
<pad2> 1
<pad3> 1

there are 256206 lines.

So now we need to add the 26 missing Chinese characters, we just modify the end of the vocab file as follow:

ydd_Hebr 1
yor_Latn 1
yue_Hant 1
zho_Hans 1
zho_Hant 1
zul_Latn 1
饱 1
畅 1
湍 1
滩 1
岭 1
舱 1
诩 1
阔 1
荫 1
鸽 1
勋 1
鸡 1
鹰 1
裙 1
艳 1
哦 1
毋庸 1
稻 1
蔗 1
熔 1
亥 1
裤 1
氢 1
《 1
》 1
… 1
<pad1> 1
<pad2> 1
<pad3> 1

But one big issue is that the SentencePiece model does NOT contain those characters and it is not very straight forward to modify a SentencePiece model in place without retraining.

Here is the magic:

from unicodedata2 import *
from collections import Counter
from tqdm import tqdm
import sentencepiece as spm
import sentencepiece_model_pb2 as model


tok_exclusion = ['<s>', '<blank>', '</s>', '<unk>', 'ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Beng', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn', '<pad1>', '<pad2>', '<pad3>', '<inv>']


newdict2 = []
with open('/nllb-200/dictionary2.txt', 'r', encoding='utf-8') as f:
    for line in f:
        token = line.strip().split()[0]
        newdict2.append(token)


serializedStr=open('/nllb-200/flores200_sacrebleu_tokenizer_spm.model', 'rb').read()
m=model.ModelProto()
m.ParseFromString(serializedStr)
curdict = []
for i in tqdm(range(len(m.pieces) - 1, 2, -1)):
    curdict.append(m.pieces[i].piece)
    if m.pieces[i].piece not in newdict2:
        hex_string = "".join("{:02x}".format(ord(c)) for c in m.pieces[i].piece)
        print("Removing: ", hex_string, " from spm model, not in dict. Index: ", i)
        m.pieces.pop(i)

for tok in tqdm(newdict2):
    if (tok not in curdict) and (tok not in tok_exclusion):
        print("Adding: ", tok, " to spm model")
        newtoken = m.SentencePiece()
        newtoken.piece = tok
        newtoken.score = 0
        m.pieces.append(newtoken)
        
print(len(m.pieces))
        
with open('/nllb-200/flores200_sacrebleu_tokenizer_spm2.model', 'wb') as f:
    f.write(m.SerializeToString())

Without going into too much details, the first tqdm loop will remove from the SPM model tokens that are not in the dictionary.txt file (this step is not necessary but it was a sanity check) and the second tqdm loop will add tokens that are in the dictionary.txt file in the SPM model, given that we don’t want the language tokens nor special tokens in the spm model.

Now let’s finetune !

To finetune NLLB-200, we need a yaml config file that require those sections:

share_vocab: true
src_vocab: "/nllb-200/dictionary2.txt"
src_words_min_frequency: 1
src_vocab_size: 256232
tgt_vocab: "/nllb-200/dictionary2.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256232
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "/nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "/nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    cc-matrix-enzh:
        path_src: "/en-zh/cc-matrix-enzh-0to30M.en"
        path_tgt: "/en-zh/cc-matrix-enzh-0to30M.zh"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "</s> eng_Latn"
        tgt_prefix: "zho_Hans"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "/nllb-200/nllb-200-1.3Bdst-onmt.pt"
reset_optim: all
save_data: "/nllb-200"
save_model: "/nllb-200/nllb-200-1.3B-onmt"
log_file: "/nllb-200/nllb-200-1.3B-onmt.log"
keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 2000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

Add as many datasets as you want ( I used cc-matrix, paracrawl and new-commentary for this test),
then run:

python3 train.py --config /nllb-200/nllb-train.yaml

If your training accuracy / ppl is off even for the first steps, then somehting is wrong with your config.
We use SGD because on a RTX 4090 (24GB) Adam would not fit with this 1.3B model.

Then we score after 2000 steps:

 sacrebleu /en-zh/testsets/newstest2019-enzh-ref.zh -m bleu -l en-zh -i newstest2019-enzh-hyp.zh
{
 "name": "BLEU",
 "score": 29.2,
 "signature": "nrefs:1|case:mixed|eff:no|tok:zh|smooth:exp|version:2.0.0",
 "verbose_score": "63.8/40.2/26.0/17.6 (BP = 0.886 ratio = 0.892 hyp_len = 71982 ref_len = 80666)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "zh",
 "smooth": "exp",
 "version": "2.0.0"
}

Not bad !

I did the same with EN-DE and was able to improve the BLEU score of NT2019 from 41.3 to 42.7 for a SOTA of about 45. I tested that the model did not lose (actually it gained on the the EN-FR test set…)
Bear in mind that with a 24GB RTX card you cannot fit the 3.3B model until we implement some kind of trick like LoRa or FSDP.

This tuto can be used to finetune any kind of language, add a new language, curate some missing characters.

Enjoy !

4 Likes

Well done, it’s a very good tutorial. It is now working as I expected. I have had some results with Fariseq in the fine-tuning of NLLB smallest model (600M).

First 2 images are direction test in order to see if performance was affected for some random language pairs (I used flores200 devtest set). Here, I only included some fine-tuned models:

The next image is the in domain test. For fine-tuning I have used 2 private datasets, first of them is 300K segments of medical data. Second of them is 50K segments of Marketing data.

As you can see, the model improves a lot if you use the correct learning rate. And general performance is barely affected.

After some days, I have not being able to replicate those high BLEU jumps in this framework:

For ES → EN the starting point seems to be worse than with Fairseq, I do not know why. I did not get that pretty good results that I got in Fairseq, but it still improves a lot from the starting point. I will keep trying to get something similar to what I got with Fairseq.

I am not sure to understand exactly what you are comparing.

Are you comparing the 3.3B CT2 converted model vs the 600M finetuned with OpenNMT-py ?

EDIT: ok I think I see are you saying Fairseq 600M is 46.2 vs Onmt-py 600M 44.6 ?
Can you show me your inference.yaml file you used ?

Yes, exactly.

I am using the same config that you suggest:

batch_size: 2048
batch_type: tokens
beam_size: 5
fp16: null
gpu: 0
log_file: translate.log
max_length: 512
model: nllb-200-600M-onmt.pt
report_time: true
src_prefix: </s> spa_Latn
src_subword_alpha: 0.0
src_subword_model: flores200_sacrebleu_tokenizer_spm.model
src_subword_nbest: 1
src_suffix: ''
tgt_file_prefix: true
tgt_prefix: eng_Latn
tgt_subword_alpha: 0.0
tgt_subword_model: flores200_sacrebleu_tokenizer_spm.model
tgt_subword_nbest: 1
tgt_suffix: ''
transforms:
- sentencepiece
- prefix
- suffix

As you can see in the previous images, I get almost the same results in EN → ES with Fairseq and OpenNMT. It is in ES → EN where I see a small drop. I also experienced a drop of 1 BLEU in EN → FR Marketing domain (which is not in the shared images). In the 7 direction test that I did I got the same results with both, Fairseq and OpenNMT (which is not in the shared images), so it seems ok to me.

Can you also post the command line you used for Fairseq ? (same beam size ?)

did you try to compare the two outputs ? very different or maybe one line completely off in the onmt output ?

I did not check out puts, but yes, I was using different decoding parameters (beam_size of 4 in Fairseq against 5 in OpenNMT).

bash $root/preprocess/normalize_punctuation.sh $slang < /dev/stdin | \
        spm_encode --model $root/preprocess/flores200_sacrebleu_tokenizer_spm.model | \
        fairseq-interactive $root --input - -s $slang -t $tlang \
            --path $ckp --batch-size 1024 --max-tokens 8192 --buffer-size 100000 \
            --beam 4 --lenpen 1.0 \
            --fp16 \
            --fixed-dictionary $root/dictionary.txt \
            --task translation_multi_simple_epoch \
            --decoder-langtok --encoder-langtok src \
            --langs $(cat $root/langs.txt) \
            --lang-pairs $slang-$tlang \
            --add-data-source-prefix-tags 2>&1 

Martin,

I realized there is another change compared to Fairseq.

you may try to use:

src_prefix: ‘spa_Latn’
src_suffix: ‘’

tgt_prefix: ‘eng_Latn’

instead of both tokens in the prefix for source.

Let me know if it makes scores closer.

I don’t know how to do this in the nllb branch of Fairseq. I couldn’t find documentation for that branch and I don’t know the Fairseq code in depth. I am not sure if this will work at all. (Changing the input to the model will probably make it give random results).

I have changed the beam size to 5 and the batch size and max tokens to match the OpenNMT configuration. I got 46 BLEU, which is similar to the previous setup (46.2) and still about 2 BLEU higher than OpenNMT.

It might be the seed or something related to randomness. I only get this drop in 1 direction and dataset over 10 that I try.

I was mentioning changing settings in onmt-py inference to match what is done natively in Fairseq. But it should not change much.

Can anyone advise on how I can download this dataset? Specifically the one used in this tutorial.

models are here: OpenNMT-py models - OpenNMT
if you’re looking for cc-matrix it’s here: Index of /cc-matrix

1 Like

Nice Tutorial, thank you so much!

I’ve questions about adding new languages. I have data in a language not covered by NLLB 200 (Monegasque language), and the objective is to fine-tune NLLB200 to create a French-Monegasque translator. I have about 5000 parallel sentences in French/Monegasque. I also have a full French-Monegasque dictionary of words and their translation, as parallel data.

As the alphabet of monegasque is the same as the latin one, there is no need to add characters in the SPM dictionary.

I wonder if the thing to do is to add the “mco_Latn” language in the dictionary.txt (as in Step 1: “We need to adapt the dictionary and the SentencePiece model.”)

As a second step, is raw data sufficient or should it be preprocessed in a way or another (using fairseq, or stopes as in here) ?

For the third step, is the finetuning following the same procedure in this case? (with the right training datasets)

Thanks for your precious help !

Hi,
Procedure is the same as adding a new vocab token.

The thing is that it may be very difficult to finetune with only 5K sentences.

Give it a try.

It also depends on what you are trying to achieve.

Monegasque to French or French to Monegasque ?

I am saying so because you can make a first shot and then perform some back-translation of monolingual data to add more data.

I have tried to do the same finetuning for nllb-3.3B, but seems like it’s just not working.
I also used same dataset for fine-tuning, got same 29.2 BLEU after 4000 steps. But right in test newstest2019-enzh-hyp.zh didn’t found any new characters. Also, looks like more characters missing, like ” and “ . And instead of missed characters, in the test translation I see many ⁇ characters. It seems like the increase in BLEU score is just due to a better dataset, but not because of added missing characters.

1 Like

Please post yout config file and also your training log.

config file:

share_vocab: true
src_vocab: "dictionary2.txt"
src_words_min_frequency: 1
src_vocab_size: 256232
tgt_vocab: "dictionary2.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256232
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    ccmatrix-enzh:
        path_src: "en-zh/filtered.en"
        path_tgt: "en-zh/filtered.zh"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "</s> eng_Latn"
        tgt_prefix: "zho_Hans"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "nllb-200-3.3B-onmt.pt"
reset_optim: all
save_data: "nllb-200"
save_model: "nllb-200/nllb-200-3.3B-onmt.pt"
log_file: "nllb-200/nllb-200-3.3B-onmt.log"
keep_checkpoint: 2
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 4000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 512
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 2048
word_vec_size: 2048
transformer_ff: 8192
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

Logs:

[2023-04-29 17:30:26,438 INFO] encoder: 1733167104
[2023-04-29 17:30:26,438 INFO] decoder: 1611413735
[2023-04-29 17:30:26,438 INFO] * number of parameters: 3344580839
[2023-04-29 17:30:26,438 INFO]  * src vocab size = 256231
[2023-04-29 17:30:26,438 INFO]  * tgt vocab size = 256231
[2023-04-29 17:30:26,445 INFO] Get prefix for ccmatrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'zho_Hans'}
[2023-04-29 17:30:26,445 INFO] Get prefix for src infer: 
[2023-04-29 17:30:26,446 INFO] Get prefix for tgt infer: 
[2023-04-29 17:30:26,524 INFO] Get suffix for ccmatrix-enzh: {'src': '', 'tgt': ''}
[2023-04-29 17:30:26,525 INFO] Get suffix for src infer: 
[2023-04-29 17:30:26,525 INFO] Get suffix for tgt infer: 
[2023-04-29 17:30:26,525 INFO] Get prefix for ccmatrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'zho_Hans'}
[2023-04-29 17:30:26,525 INFO] Get prefix for src infer: 
[2023-04-29 17:30:26,525 INFO] Get prefix for tgt infer: 
[2023-04-29 17:30:26,590 INFO] Get suffix for ccmatrix-enzh: {'src': '', 'tgt': ''}
[2023-04-29 17:30:26,590 INFO] Get suffix for src infer: 
[2023-04-29 17:30:26,590 INFO] Get suffix for tgt infer: 
[2023-04-29 17:30:26,628 INFO] Starting training on GPU: [0]
[2023-04-29 17:30:26,628 INFO] Start training loop without validation...
[2023-04-29 17:30:26,628 INFO] Scoring with: TransformPipe()
[2023-04-29 17:32:37,397 INFO] Step 10/ 4000; acc: 51.2; ppl:  82.5; xent: 4.4; lr: 0.00729; sents:    7188; bsz:  454/ 464/22; 1110/1136 tok/s;    131 sec;
[2023-04-29 17:33:50,157 INFO] Step 20/ 4000; acc: 57.9; ppl:  48.3; xent: 3.9; lr: 0.01392; sents:    6560; bsz:  453/ 463/20; 1993/2038 tok/s;    204 sec;
[2023-04-29 17:35:03,762 INFO] Step 30/ 4000; acc: 61.2; ppl:  36.4; xent: 3.6; lr: 0.02055; sents:    6443; bsz:  450/ 462/20; 1957/2007 tok/s;    277 sec;
[2023-04-29 17:36:17,691 INFO] Step 40/ 4000; acc: 66.7; ppl:  27.4; xent: 3.3; lr: 0.02718; sents:    6274; bsz:  455/ 465/20; 1968/2011 tok/s;    351 sec;
[2023-04-29 17:37:31,037 INFO] Step 50/ 4000; acc: 68.0; ppl:  24.5; xent: 3.2; lr: 0.03381; sents:    6395; bsz:  454/ 464/20; 1981/2025 tok/s;    424 sec;
[2023-04-29 17:38:44,817 INFO] Step 60/ 4000; acc: 67.9; ppl:  23.1; xent: 3.1; lr: 0.04044; sents:    6228; bsz:  443/ 461/19; 1922/2000 tok/s;    498 sec;
[2023-04-29 17:39:58,430 INFO] Step 70/ 4000; acc: 69.4; ppl:  20.4; xent: 3.0; lr: 0.04707; sents:    5959; bsz:  453/ 464/19; 1970/2017 tok/s;    572 sec;
[2023-04-29 17:41:12,605 INFO] Step 80/ 4000; acc: 69.6; ppl:  19.2; xent: 3.0; lr: 0.05370; sents:    6307; bsz:  452/ 464/20; 1952/2000 tok/s;    646 sec;
[2023-04-29 17:42:26,114 INFO] Step 90/ 4000; acc: 69.6; ppl:  18.2; xent: 2.9; lr: 0.06033; sents:    6740; bsz:  453/ 463/21; 1970/2017 tok/s;    719 sec;
[2023-04-29 17:43:38,550 INFO] Step 100/ 4000; acc: 70.1; ppl:  16.8; xent: 2.8; lr: 0.06596; sents:    6632; bsz:  453/ 464/21; 2002/2050 tok/s;    792 sec;
[2023-04-29 17:43:38,550 INFO] Train perplexity: 27.7445
[2023-04-29 17:43:38,550 INFO] Train accuracy: 65.1595
[2023-04-29 17:43:38,551 INFO] Sentences processed: 64726
[2023-04-29 17:43:38,551 INFO] Average bsz:  452/ 463/20
[2023-04-29 17:43:38,678 INFO] Saving checkpoint nllb-200/nllb-200-3.3B-onmt.pt_step_100.pt

...............................................................

[2023-04-30 01:20:03,059 INFO] Train perplexity: 13.5463
[2023-04-30 01:20:03,059 INFO] Train accuracy: 74.7917
[2023-04-30 01:20:03,060 INFO] Sentences processed: 2.31756e+06
[2023-04-30 01:20:03,060 INFO] Average bsz:  453/ 463/19
[2023-04-30 01:20:03,175 INFO] Saving checkpoint nllb-200/nllb-200-3.3B-onmt.pt_step_3800.pt
[2023-04-30 01:21:32,005 INFO] Step 3810/ 4000; acc: 74.0; ppl:  13.8; xent: 2.6; lr: 0.01074; sents:    5324; bsz:  454/ 460/17; 1632/1657 tok/s;  28265 sec;
[2023-04-30 01:22:43,397 INFO] Step 3820/ 4000; acc: 74.1; ppl:  13.9; xent: 2.6; lr: 0.01072; sents:    5468; bsz:  454/ 462/17; 2036/2072 tok/s;  28337 sec;
[2023-04-30 01:23:56,030 INFO] Step 3830/ 4000; acc: 74.1; ppl:  13.9; xent: 2.6; lr: 0.01071; sents:    5553; bsz:  452/ 462/17; 1993/2036 tok/s;  28409 sec;
[2023-04-30 01:25:07,607 INFO] Step 3840/ 4000; acc: 74.3; ppl:  13.7; xent: 2.6; lr: 0.01070; sents:    5364; bsz:  451/ 461/17; 2019/2060 tok/s;  28481 sec;
[2023-04-30 01:26:19,346 INFO] Step 3850/ 4000; acc: 74.3; ppl:  13.6; xent: 2.6; lr: 0.01068; sents:    5552; bsz:  454/ 463/17; 2026/2064 tok/s;  28553 sec;
[2023-04-30 01:27:31,117 INFO] Step 3860/ 4000; acc: 74.3; ppl:  13.7; xent: 2.6; lr: 0.01067; sents:    5909; bsz:  455/ 462/18; 2026/2060 tok/s;  28624 sec;
[2023-04-30 01:28:42,762 INFO] Step 3870/ 4000; acc: 74.3; ppl:  13.8; xent: 2.6; lr: 0.01065; sents:    5816; bsz:  453/ 461/18; 2025/2060 tok/s;  28696 sec;
[2023-04-30 01:29:54,056 INFO] Step 3880/ 4000; acc: 74.0; ppl:  13.8; xent: 2.6; lr: 0.01064; sents:    5442; bsz:  454/ 462/17; 2036/2072 tok/s;  28767 sec;
[2023-04-30 01:31:05,780 INFO] Step 3890/ 4000; acc: 74.2; ppl:  13.8; xent: 2.6; lr: 0.01063; sents:    5990; bsz:  455/ 464/19; 2028/2068 tok/s;  28839 sec;
[2023-04-30 01:32:18,066 INFO] Step 3900/ 4000; acc: 74.2; ppl:  13.7; xent: 2.6; lr: 0.01061; sents:    5238; bsz:  455/ 462/16; 2015/2046 tok/s;  28911 sec;
[2023-04-30 01:32:18,067 INFO] Train perplexity: 13.5518
[2023-04-30 01:32:18,067 INFO] Train accuracy: 74.7758
[2023-04-30 01:32:18,067 INFO] Sentences processed: 2.37322e+06
[2023-04-30 01:32:18,067 INFO] Average bsz:  453/ 463/19
[2023-04-30 01:32:18,178 INFO] Saving checkpoint nllb-200/nllb-200-3.3B-onmt.pt_step_3900.pt
[2023-04-30 01:33:47,324 INFO] Step 3910/ 4000; acc: 74.4; ppl:  13.6; xent: 2.6; lr: 0.01060; sents:    5329; bsz:  453/ 461/17; 1625/1652 tok/s;  29001 sec;
[2023-04-30 01:34:59,774 INFO] Step 3920/ 4000; acc: 74.1; ppl:  13.8; xent: 2.6; lr: 0.01059; sents:    5439; bsz:  453/ 461/17; 2001/2038 tok/s;  29073 sec;
[2023-04-30 01:36:12,627 INFO] Step 3930/ 4000; acc: 74.0; ppl:  13.8; xent: 2.6; lr: 0.01057; sents:    5723; bsz:  452/ 462/18; 1986/2028 tok/s;  29146 sec;
[2023-04-30 01:37:25,952 INFO] Step 3940/ 4000; acc: 74.0; ppl:  14.0; xent: 2.6; lr: 0.01056; sents:    5668; bsz:  454/ 463/18; 1982/2021 tok/s;  29219 sec;
[2023-04-30 01:38:38,441 INFO] Step 3950/ 4000; acc: 74.1; ppl:  13.7; xent: 2.6; lr: 0.01055; sents:    5301; bsz:  450/ 460/17; 1985/2031 tok/s;  29292 sec;
[2023-04-30 01:39:50,606 INFO] Step 3960/ 4000; acc: 73.9; ppl:  13.9; xent: 2.6; lr: 0.01053; sents:    5732; bsz:  450/ 460/18; 1994/2041 tok/s;  29364 sec;
[2023-04-30 01:41:02,304 INFO] Step 3970/ 4000; acc: 74.3; ppl:  13.6; xent: 2.6; lr: 0.01052; sents:    5621; bsz:  454/ 462/18; 2025/2061 tok/s;  29436 sec;
[2023-04-30 01:42:13,855 INFO] Step 3980/ 4000; acc: 74.1; ppl:  13.6; xent: 2.6; lr: 0.01051; sents:    5607; bsz:  452/ 459/18; 2023/2051 tok/s;  29507 sec;
[2023-04-30 01:43:25,287 INFO] Step 3990/ 4000; acc: 74.5; ppl:  13.5; xent: 2.6; lr: 0.01049; sents:    5374; bsz:  453/ 462/17; 2029/2069 tok/s;  29579 sec;
[2023-04-30 01:44:36,800 INFO] Step 4000/ 4000; acc: 74.3; ppl:  13.6; xent: 2.6; lr: 0.01048; sents:    5261; bsz:  453/ 463/16; 2026/2072 tok/s;  29650 sec;
[2023-04-30 01:44:36,800 INFO] Train perplexity: 13.5562
[2023-04-30 01:44:36,800 INFO] Train accuracy: 74.7606
[2023-04-30 01:44:36,800 INFO] Sentences processed: 2.42828e+06
[2023-04-30 01:44:36,800 INFO] Average bsz:  453/ 463/19
[2023-04-30 01:44:36,912 INFO] Saving checkpoint nllb-200/nllb-200-3.3B-onmt.pt_step_4000.pt

it should show 256232. Below those lines did you see “update vocab” and 26 tokens added ?

Can you also show the config file for inference ?

Yes, not below, but above:

[2023-04-29 17:28:55,376 INFO] Parsed 1 corpora from -data.
[2023-04-29 17:28:55,376 INFO] Loading checkpoint from nllb-200-3.3B-onmt.pt
[2023-04-29 17:28:59,467 WARNING] configured transforms is different from checkpoint: +{'sentencepiece', 'suffix', 'prefix'}
[2023-04-29 17:28:59,467 INFO] Get prefix for ccmatrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'zho_Hans'}
[2023-04-29 17:28:59,467 INFO] Get prefix for src infer: 
[2023-04-29 17:28:59,467 INFO] Get prefix for tgt infer: 
[2023-04-29 17:28:59,467 INFO] Get suffix for ccmatrix-enzh: {'src': '', 'tgt': ''}
[2023-04-29 17:28:59,468 INFO] Get suffix for src infer: 
[2023-04-29 17:28:59,468 INFO] Get suffix for tgt infer: 
[2023-04-29 17:28:59,468 INFO] Get special vocabs from Transforms: {'src': ['eng_Latn', '</s>'], 'tgt': ['zho_Hans']}.
[2023-04-29 17:29:00,243 INFO] Updating checkpoint vocabulary with new vocabulary
[2023-04-29 17:29:00,245 INFO] Get prefix for ccmatrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'zho_Hans'}
[2023-04-29 17:29:00,246 INFO] Get prefix for src infer: 
[2023-04-29 17:29:00,247 INFO] Get prefix for tgt infer: 
[2023-04-29 17:29:00,249 INFO] Get suffix for ccmatrix-enzh: {'src': '', 'tgt': ''}
[2023-04-29 17:29:00,250 INFO] Get suffix for src infer: 
[2023-04-29 17:29:00,253 INFO] Get suffix for tgt infer: 
[2023-04-29 17:29:00,255 INFO] Get special vocabs from Transforms: {'src': ['eng_Latn', '</s>'], 'tgt': ['zho_Hans']}.
[2023-04-29 17:29:01,140 INFO] Building model...
[2023-04-29 17:30:06,334 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-04-29 17:30:07,773 INFO] src: 26 new tokens
[2023-04-29 17:30:10,594 INFO] tgt: 26 new tokens
[2023-04-29 17:30:26,426 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(256231, 2048, padding_idx=2)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
.....

config for inference:

src_prefix: "eng_Latn"
tgt_prefix: "zho_Hans"
tgt_file_prefix: true
src_suffix: "</s>"
tgt_suffix: ""

#### Subword
src_subword_model: "/workspace/my/flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "/workspace/my/flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Model info
model: "/workspace/my/nllb-200/nllb-200-3.3B-onmt.pt_step_4000.pt"
# Inference
max_length: 512
gpu: 0
batch_type: tokens
batch_size: 2048
fp16:
beam_size: 5
report_time: true

I used same news2019 test

something must be wrong with your vocab file. 256206 + 26 = 256232
your log shows 256231