Finetuning and Curating NLLB-200 with OpenNMT-py

ILG2021 · July 29, 2023, 6:57am

So will you commit a pull request to these two project, opennmt and SentencePiece?

ILG2021 · July 30, 2023, 5:54am

Hello, thanks for your wonderful job. But still I have some questions:

If I have trained english-chinese, should I train chinese-english also?
how to train multi-language? use array in the data section? like this?

   data:
        -custom-tlzh:
            path_src: "./nllb-200/dataset.tl"
            path_tgt: "./nllb-200/dataset.zh"
            transforms: [sentencepiece, prefix, suffix, filtertoolong]
            weight: 10
            src_prefix: "</s> tgl_Latn"
            tgt_prefix: "zho_Hans"
            src_suffix: ""
            tgt_suffix: ""
        -custom-enzh:
            path_src: "./nllb-200/dataset.en"
            path_tgt: "./nllb-200/dataset.zh"
            transforms: [sentencepiece, prefix, suffix, filtertoolong]
            weight: 10
            src_prefix: "</s> eng_Latn"
            tgt_prefix: "zho_Hans"
            src_suffix: ""
            tgt_suffix: ""

vince62s · July 30, 2023, 10:43am

you need this

src_prefix: "tgl_Latn"
            tgt_prefix: "zho_Hans"
            src_suffix: "</s> "

you don’t need necessarily the other side

Best is to use the 1.3B or 3.3B with LoRa and make sure the accuracy of first training steps is not too low.

ILG2021 · July 30, 2023, 1:34pm

Thanks for your reply. Can you give me the complete config? I want to train english to chinese and taglog to chinese at the same time. I will convert the fine-tuned pytorch model to ctranslate2. So can I still use LoRa?

ILG2021 · July 31, 2023, 3:10am

When I fine-tune 3.3B or 1.3B(notebook on cloud GPU), it gives the error below:

  File "/workspace/OpenNMT-py/onmt/train_single.py", line 165, in main
    model = build_model(model_opt, opt, vocabs, checkpoint)
  File "/workspace/OpenNMT-py/onmt/model_builder.py", line 412, in build_model
    model.load_state_dict(
  File "/workspace/OpenNMT-py/onmt/models/model.py", line 142, in load_state_dict
raise ValueError(
ValueError: Extra keys in model state_dict do not match the model config dict_keys

Only the 1.3B can be fine-tune.

ILG2021 · August 1, 2023, 5:32am

Hello, which GPU do you use for finetune 3.3B? How many memory is needed for finetune 3.3B?

sersh · August 1, 2023, 6:10am

I used A100 GPU with 80Gb memory. Now it is possible to fine-tune with 24Gb memory using lora ( not tried ).

ILG2021 · August 1, 2023, 10:35am

Thank you, I will have a try. 3.3B maybe cost more than 60G am I right?

sersh · August 1, 2023, 1:04pm

depends on batch size

kitkhai · August 21, 2023, 10:13am

Hey Vincent,

Thank you for the tutorial. I am using the nllb-200-600M-onmt.pt checkpoint and have followed every single detail in your tutorial, but while trying to fine tune the model, I’m getting this error in Colab: AssertionError: An error in model’s partition and checkpoint’s slice was detected

Is there something I need to change in my train.yml file?

Thanks!

vince62s · August 22, 2023, 8:04am

can you post your yml file?

kitkhai · August 22, 2023, 10:44am

train.yml:

share_vocab: true
src_vocab: "/content/drive/MyDrive/OpenNMT-py/nllb-200/dictionary2.txt"
src_words_min_frequency: 1
src_vocab_size: 256025
tgt_vocab: "/content/drive/MyDrive/OpenNMT-py/nllb-200/dictionary2.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256025
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "/content/drive/MyDrive/OpenNMT-py/nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "/content/drive/MyDrive/OpenNMT-py/nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    cc-matrix-enzh:
        path_src: "/content/drive/MyDrive/OpenNMT-py/en-zh/cc-matrix-enzh-0to30M.en"
        path_tgt: "/content/drive/MyDrive/OpenNMT-py/en-zh/cc-matrix-enzh-0to30M.zh"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "</s> eng_Latn"
        tgt_prefix: "zho_Hans"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "/content/drive/MyDrive/OpenNMT-py/nllb-200/nllb-200-600M-onmt.pt"
reset_optim: all
save_data: "/content/drive/MyDrive/OpenNMT-py/nllb-200"
save_model: "/content/drive/MyDrive/OpenNMT-py/nllb-200/nllb-200-600M-onmt2.pt"
log_file: "/content/drive/MyDrive/OpenNMT-py/nllb-200/nllb-200-600M-onmt.log"
keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 2000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

Does it have to do with the fact that I used a different checkpoint compared to your tutorial? I’m also not sure if my vocab size is correct but this was the output from the code to modify the SentencePiece model.

vince62s · August 22, 2023, 11:58am

What checkpoint are you talking about?

kitkhai · August 22, 2023, 2:45pm

I am using the nllb-200-600M-onmt.pt checkpoint from the s3 server.

vince62s · August 22, 2023, 4:44pm

then you need:
enc_layers: 12
dec_layers: 12
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 4096

kitkhai · August 23, 2023, 2:05am

Thank you so much! It worked! How may I retrieve such information about the model architecture if I have to finetune another model checkpoint in the future?

Now I get another error message

(/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ')

that seems to be because I was loading too much data at once on my collab notebook, so I think I’ll reduce the amount of data that use (600K) to around 300K

kitkhai · August 24, 2023, 8:21am

Hi again @vince62s

I saw that you used 341K lines so I tried using 300K lines of training data and subsequently 10 lines of training data. However, I was still thrown the same error, even after reduce the batch size to 1. I am using Google collab that provides around 12GB System RAM & 15GB GPU RAM.

(/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ')

My google search attributes the error message to running out of memory (my system ram seems to be the problem, not the GPU ram) but I’m not sure what else I can change. Currently my batching and optimisation configuration is as follows:

# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 1
valid_batch_size: 1
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]

# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

vince62s · August 25, 2023, 7:43am

batch_size: 1 and batch_type: tokens mean that you use batches of 1 token, non sense.
also don’t use sgd, use adam with a lr of 1e-4

kitkhai · August 28, 2023, 1:22am

Hi @vince62s

The NLLB 200 600M - Transformer checkpoint that I downloaded from OpenNMT-py models - OpenNMT gave me really weird results.

When I ran:

python3 ~/OpenNMT-py/translate.py --config nllb-inference.yaml -src /en-zh/testsets/newstest2019-enzh-src.en -output newstest2019-enzh-hyp.zh

My input English sentence:

My uncle saw that the eagle caught the chickens

My model output in Chinese (Simplified) was complete gibberish and repetitive:

现在,我知道这个问题是什么,我知道这个问题是什么,我认为这是什么.

I used the exact same inference yaml file and only chance the reference to the model checkpoint, hence I am really confused what went wrong.

vince62s · August 29, 2023, 12:57pm

You must be using master and not the last pip version. I will push a fix for this.

You can git pull and try again.