Finetuning and Curating NLLB-200 with OpenNMT-py

root@6add97d7bc22:/workspace/my# wc -l dictionary.txt 
256206 dictionary.txt
root@6add97d7bc22:/workspace/my# wc -l dictionary2.txt 
256232 dictionary2.txt
root@6add97d7bc22:/workspace/my# tail -n 30 dictionary2.txt 
zul_Latn 1
饱 1
畅 1
湍 1
滩 1
岭 1
舱 1
诩 1
阔 1
荫 1
鸽 1
勋 1
鸡 1
鹰 1
裙 1
艳 1
哦 1
毋庸 1
稻 1
蔗 1
熔 1
亥 1
裤 1
氢 1
《 1
》 1
... 1
… 1
<pad1> 1
<pad2> 1
root@6add97d7bc22:/workspace/my# 

Strange, will try to cure spm model again

might be that two lines of the vocab file contain the same token. (might not come from the spm file).

Yes, the token “…” already in the dictionary

Do you think this may cause the problem with not added characters?

No I don’t think so and as a matter of fact I need to modify the tuto.
I will re-run it on my side and check what is going on.

EDIT:

So I did some tests. It just happens that those 26 tokens (beside the …) are not so common in the datasets I checked (cc-matrix, paracrawl, news-commentary). I was relying on the comment one guy did on the fairseq repo but obviously those tokens do not seem to be so necessary.

However, I did the following:
put the first 25 new tokens into a newtok.txt file (without the frequency)

paste cc-matrix.enzh.en cc-matrix.enzh.zh > cc-matrix.enzh.tsv
grep -af newtok.txt cc-matrix.enzh.tsv > cc-matrix.enzh.newtok.tsv
cut -f1 cc-matrix.enzh.newtok.tsv > cc-matrix.enzh.newtok.en
cut -f2 cc-matrix.enzh.newtok.tsv > cc-matrix.enzh.newtok.zh

gives 341K lines out of 21M
Did the same for paracrawl.

Finetuned on the restricted data.

It does learn a few tokens (滩, 鸡, 《, 》) maybe it requires training longer, BLEU is 30

But there is still a lot of “??” which the the token for sentencepiece. I don’ t speak Chinese but maybe if you identify some missing characters, it may help further.

The all procedure is fine, but learning new embeddings and impacting the model takes more training obviously. maybe the BLEU increase comes from the 《, 》characters, there seem to be a lot.

1 Like

Thank you! Looks promising. We just need to determine missed characters somehow.
I don’t speak Chinese also, but asked chatgpt to find missing characters. For first lines in newstest it gives





I think it is possible to identify all, just asking chatgpt for every line with ??. But how to filter ones already in sentencepiece model?

Just by comparing reference translation from newstest2019 and generated by nllb, I got 446 missed characters. Filtered characters already in dictionary.txt and got 245 missed characters. I guess there are more. But it’s a good starting point

惫
衅
劾
舵
鲍
蛾
颇
苹
隘
…
脾
赌
邂
蜡
栋
窍
匪
;
甩
伞
妈
)
?
“
盈
瞩
崛
哗
哮
夹
缄
帜
吨
黛
抨
谐
垄
瀚
鄙
涡
:
哽
蛊
蹂
帘
呐
蝇
浩
噬
瞒
镑
缮
聆
汤
谩
熬
阁
撑
淇
崔
诙
寐
烛
爹
搁
铀
芒
搂
痪
幢
&
弩
贺
妆
脍
坠
炬
宠
谊
瞪
雳
袂
蚁
踢
萝
岗
菠
‘
崽
攀
瞄
绅
觅
衔
茵
闹
侈
茁
栏
魁
椰
瘫
顽
坪
哑
蓬
辉
笨
袁
闲
郑
巅
茫
飓
沮
掷
躏
吴
”
牵
碧
搏
挡
颈
’
坍
屡
(
镍
橙
哇
霹
碾
俑
逅
砾
栖
坞
腥
奄
咽
溃
悄
赁
凰
晤
匮
铲
扑
炒
袱
铝
遏
汀
½
诀
—
呃
鸵
蚂
钦
虾
滨
唏
渗
瑰
韵
刘
涨
凛
烬
乍
赈
硅
啸
挣
僵
翩
缠
诽
尴
肇
肮
!
腾
凤
颖
浓
讽
煌
泪
砥
飘
骼
鲨
厦
蹚
咆
榈
琼
斩
锅
喂
檐
擂
挠
肃
垫
5
陋
滕
舆
澜
寅
豫
顷
%
铃
弯
岌
嘻
榄
,
凿
劲
耸
炙
彭
谍
辙
缤
锋
扳
贼
豹
溅
锐
叮
篷
钞
凑
憨
纱
蝎
润

I want to finetune the 600M NLLB model for a language not in NLLB. Why do I need to change in the config? Tried the 1.3B but failed with OOM issue. What gpu memory size I need to finetune the 1.3B?

read the other thread about LoRa you will be able to finetune bigger models than 600M (this one is not very good).

Thank you! The 1.3B NLLB is working using LoRa. But I had to reduce the batch_size to 256 and there was OOM at some step. You used 384 batch_size. What is with 384? it doesn’t look a random number.

[2023-05-18 08:51:11,186 INFO] Get prefix for cc-matrix-enzh: {‘src’: ‘ eng_Latn’, ‘tgt’: ‘gez_Ethi’}
[2023-05-18 08:51:11,186 INFO] Get prefix for src infer:
[2023-05-18 08:51:11,186 INFO] Get prefix for tgt infer:
[2023-05-18 08:51:11,186 INFO] Get suffix for cc-matrix-enzh: {‘src’: ‘’, ‘tgt’: ‘’}
[2023-05-18 08:51:11,186 INFO] Get suffix for src infer:
[2023-05-18 08:51:11,186 INFO] Get suffix for tgt infer:
[2023-05-18 08:51:11,266 INFO] Get prefix for cc-matrix-enzh: {‘src’: ‘ eng_Latn’, ‘tgt’: ‘gez_Ethi’}
[2023-05-18 08:51:11,266 INFO] Get prefix for src infer:
[2023-05-18 08:51:11,266 INFO] Get prefix for tgt infer:
[2023-05-18 08:51:11,309 INFO] Starting training on GPU: [0]
[2023-05-18 08:51:11,309 INFO] Start training loop without validation…
[2023-05-18 08:51:11,309 INFO] Scoring with: TransformPipe()
[2023-05-18 08:52:43,343 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,394 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,436 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,479 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,522 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,564 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,603 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,646 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,690 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,735 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,777 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,821 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,863 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,906 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,947 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,987 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:44,027 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:53:40,481 INFO] Step 10/20000; acc: 87.1; ppl: 41.1; xent: 3.7; lr: 0.01031; sents: 2059; bsz: 242/ 173/ 7; 491/350 tok/s; 149 sec;
[2023-05-18 08:55:01,678 INFO] Step 20/20000; acc: 88.6; ppl: 34.7; xent: 3.5; lr: 0.01969; sents: 2012; bsz: 228/ 171/ 6; 901/673 tok/s; 230 sec;
1

post your config in the other thread. it’s better to track the issues / questions wrt LoRa.

1 Like

hi,Vince,I am tring to fine tuning this model,but I can’t find the “dictionary 2.txt”,where is it?can you show me some demo?

read the tuto again
you need to create dictionary2.txt yourself based on dictionary.txt

Yeah but where is the “dictionary.txt”,i can’t find it in opensource,can you show me a link,thank you!

1 Like

Ok,thanks :tulip:

Hello, I got some problems in the “magic”. Here are the errors:

python magic.py
Traceback (most recent call last):
  File "magic.py", line 5, in <module>
    import sentencepiece_model_pb2 as model
ModuleNotFoundError: No module named 'sentencepiece_model_pb2'

Hello, you can try replacing

import sentencepiece_model_pb2 as model

with

import sentencepiece.sentencepiece_model_pb2 as model

Hey Vincent,

Thanks for the extensive tutorial. I’m getting this error in Colab
“RuntimeError: The expanded size of the tensor (1024) must match the existing size (2048) at non-singleton dimension 0. Target sizes: [1024]. Tensor sizes: [2048]”

Here’s my training config.



share_vocab: true
src_vocab: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256254
tgt_vocab: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256254
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/flores200_sacrebleu_tokenizer_spm_fp16.model"
tgt_subword_model: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/flores200_sacrebleu_tokenizer_spm_fp16.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    flores:
        path_src: "/content/drive/MyDrive/MT5/NLLB_CT2/flores200_dataset/devtest/vie_Latn.devtest"
        path_tgt: "/content/drive/MyDrive/MT5/NLLB_CT2/flores200_dataset/devtest/tha_Thai.devtest"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "vie_Latn"
        tgt_prefix: "tha_Thai"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/nllb-200-3.3B-onmt.pt"
reset_optim: all
save_data: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/"
save_model: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/nllb-200-3.3B-onmt"
log_file: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/nllb-200-3.3B-onmt.log"
keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 100
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'


`

And here is the inference config, which runs well.

share_vocab: true
src_vocab: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256254
tgt_vocab: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256254
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/flores200_sacrebleu_tokenizer_spm_fp16.model"
tgt_subword_model: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/flores200_sacrebleu_tokenizer_spm_fp16.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    flores:
        path_src: "/content/drive/MyDrive/MT5/NLLB_CT2/flores200_dataset/devtest/vie_Latn.devtest"
        path_tgt: "/content/drive/MyDrive/MT5/NLLB_CT2/flores200_dataset/devtest/tha_Thai.devtest"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "vie_Latn"
        tgt_prefix: "tha_Thai"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/nllb-200-3.3B-onmt.pt"
reset_optim: all
save_data: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/"
save_model: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/nllb-200-3.3B-onmt"
log_file: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/nllb-200-3.3B-onmt.log"
keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 100
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

Let me know if I can provide extra logs. Thanks in advance

You need to match the config of the model you’re trying to finetune

hidden_size: 1024
word_vec_size: 1024

needs to be replaced by 2048 for the 3.3B

But I doubt you’ll be able to do this on Colab. You may need to use LoRA (read other threads)

2 Likes