Finetuning and Curating NLLB-200 with OpenNMT-py

Thank you! The 1.3B NLLB is working using LoRa. But I had to reduce the batch_size to 256 and there was OOM at some step. You used 384 batch_size. What is with 384? it doesn’t look a random number.

[2023-05-18 08:51:11,186 INFO] Get prefix for cc-matrix-enzh: {‘src’: ‘ eng_Latn’, ‘tgt’: ‘gez_Ethi’}
[2023-05-18 08:51:11,186 INFO] Get prefix for src infer:
[2023-05-18 08:51:11,186 INFO] Get prefix for tgt infer:
[2023-05-18 08:51:11,186 INFO] Get suffix for cc-matrix-enzh: {‘src’: ‘’, ‘tgt’: ‘’}
[2023-05-18 08:51:11,186 INFO] Get suffix for src infer:
[2023-05-18 08:51:11,186 INFO] Get suffix for tgt infer:
[2023-05-18 08:51:11,266 INFO] Get prefix for cc-matrix-enzh: {‘src’: ‘ eng_Latn’, ‘tgt’: ‘gez_Ethi’}
[2023-05-18 08:51:11,266 INFO] Get prefix for src infer:
[2023-05-18 08:51:11,266 INFO] Get prefix for tgt infer:
[2023-05-18 08:51:11,309 INFO] Starting training on GPU: [0]
[2023-05-18 08:51:11,309 INFO] Start training loop without validation…
[2023-05-18 08:51:11,309 INFO] Scoring with: TransformPipe()
[2023-05-18 08:52:43,343 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,394 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,436 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,479 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,522 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,564 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,603 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,646 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,690 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,735 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,777 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,821 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,863 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,906 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,947 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:43,987 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:52:44,027 INFO] Step 3, cuda OOM - batch removed
[2023-05-18 08:53:40,481 INFO] Step 10/20000; acc: 87.1; ppl: 41.1; xent: 3.7; lr: 0.01031; sents: 2059; bsz: 242/ 173/ 7; 491/350 tok/s; 149 sec;
[2023-05-18 08:55:01,678 INFO] Step 20/20000; acc: 88.6; ppl: 34.7; xent: 3.5; lr: 0.01969; sents: 2012; bsz: 228/ 171/ 6; 901/673 tok/s; 230 sec;
1

post your config in the other thread. it’s better to track the issues / questions wrt LoRa.

1 Like

hi,Vince,I am tring to fine tuning this model,but I can’t find the “dictionary 2.txt”,where is it?can you show me some demo?

read the tuto again
you need to create dictionary2.txt yourself based on dictionary.txt

Yeah but where is the “dictionary.txt”,i can’t find it in opensource,can you show me a link,thank you!

1 Like

Ok,thanks :tulip:

Hello, I got some problems in the “magic”. Here are the errors:

python magic.py
Traceback (most recent call last):
  File "magic.py", line 5, in <module>
    import sentencepiece_model_pb2 as model
ModuleNotFoundError: No module named 'sentencepiece_model_pb2'

Hello, you can try replacing

import sentencepiece_model_pb2 as model

with

import sentencepiece.sentencepiece_model_pb2 as model

Hey Vincent,

Thanks for the extensive tutorial. I’m getting this error in Colab
“RuntimeError: The expanded size of the tensor (1024) must match the existing size (2048) at non-singleton dimension 0. Target sizes: [1024]. Tensor sizes: [2048]”

Here’s my training config.



share_vocab: true
src_vocab: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256254
tgt_vocab: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256254
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/flores200_sacrebleu_tokenizer_spm_fp16.model"
tgt_subword_model: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/flores200_sacrebleu_tokenizer_spm_fp16.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    flores:
        path_src: "/content/drive/MyDrive/MT5/NLLB_CT2/flores200_dataset/devtest/vie_Latn.devtest"
        path_tgt: "/content/drive/MyDrive/MT5/NLLB_CT2/flores200_dataset/devtest/tha_Thai.devtest"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "vie_Latn"
        tgt_prefix: "tha_Thai"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/nllb-200-3.3B-onmt.pt"
reset_optim: all
save_data: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/"
save_model: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/nllb-200-3.3B-onmt"
log_file: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/nllb-200-3.3B-onmt.log"
keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 100
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'


`

And here is the inference config, which runs well.

share_vocab: true
src_vocab: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256254
tgt_vocab: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256254
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/flores200_sacrebleu_tokenizer_spm_fp16.model"
tgt_subword_model: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/flores200_sacrebleu_tokenizer_spm_fp16.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    flores:
        path_src: "/content/drive/MyDrive/MT5/NLLB_CT2/flores200_dataset/devtest/vie_Latn.devtest"
        path_tgt: "/content/drive/MyDrive/MT5/NLLB_CT2/flores200_dataset/devtest/tha_Thai.devtest"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "vie_Latn"
        tgt_prefix: "tha_Thai"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "/content/drive/MyDrive/MT5/NLLB_CT2/nllb-200-3.3B-onmt/nllb-200-3.3B-onmt.pt"
reset_optim: all
save_data: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/"
save_model: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/nllb-200-3.3B-onmt"
log_file: "/content/drive/MyDrive/MT5/NLLB_CT2/trained_28_6/nllb-200-3.3B-onmt.log"
keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 100
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

Let me know if I can provide extra logs. Thanks in advance

You need to match the config of the model you’re trying to finetune

hidden_size: 1024
word_vec_size: 1024

needs to be replaced by 2048 for the 3.3B

But I doubt you’ll be able to do this on Colab. You may need to use LoRA (read other threads)

2 Likes

So will you commit a pull request to these two project, opennmt and SentencePiece?

Hello, thanks for your wonderful job. But still I have some questions:

  1. If I have trained english-chinese, should I train chinese-english also?
  2. how to train multi-language? use array in the data section? like this?
   data:
        -custom-tlzh:
            path_src: "./nllb-200/dataset.tl"
            path_tgt: "./nllb-200/dataset.zh"
            transforms: [sentencepiece, prefix, suffix, filtertoolong]
            weight: 10
            src_prefix: "</s> tgl_Latn"
            tgt_prefix: "zho_Hans"
            src_suffix: ""
            tgt_suffix: ""
        -custom-enzh:
            path_src: "./nllb-200/dataset.en"
            path_tgt: "./nllb-200/dataset.zh"
            transforms: [sentencepiece, prefix, suffix, filtertoolong]
            weight: 10
            src_prefix: "</s> eng_Latn"
            tgt_prefix: "zho_Hans"
            src_suffix: ""
            tgt_suffix: ""

you need this

src_prefix: "tgl_Latn"
            tgt_prefix: "zho_Hans"
            src_suffix: "</s> "

you don’t need necessarily the other side

Best is to use the 1.3B or 3.3B with LoRa and make sure the accuracy of first training steps is not too low.

1 Like

Thanks for your reply. Can you give me the complete config? I want to train english to chinese and taglog to chinese at the same time. I will convert the fine-tuned pytorch model to ctranslate2. So can I still use LoRa?

When I fine-tune 3.3B or 1.3B(notebook on cloud GPU), it gives the error below:

  File "/workspace/OpenNMT-py/onmt/train_single.py", line 165, in main
    model = build_model(model_opt, opt, vocabs, checkpoint)
  File "/workspace/OpenNMT-py/onmt/model_builder.py", line 412, in build_model
    model.load_state_dict(
  File "/workspace/OpenNMT-py/onmt/models/model.py", line 142, in load_state_dict
raise ValueError(
ValueError: Extra keys in model state_dict do not match the model config dict_keys

Only the 1.3B can be fine-tune.

Hello, which GPU do you use for finetune 3.3B? How many memory is needed for finetune 3.3B?

I used A100 GPU with 80Gb memory. Now it is possible to fine-tune with 24Gb memory using lora ( not tried ).

1 Like

Thank you, I will have a try. 3.3B maybe cost more than 60G am I right?

depends on batch size

1 Like