Finetuning and Curating NLLB-200 with OpenNMT-py

ILG2021 · July 31, 2023, 3:10am

When I fine-tune 3.3B or 1.3B(notebook on cloud GPU), it gives the error below:

  File "/workspace/OpenNMT-py/onmt/train_single.py", line 165, in main
    model = build_model(model_opt, opt, vocabs, checkpoint)
  File "/workspace/OpenNMT-py/onmt/model_builder.py", line 412, in build_model
    model.load_state_dict(
  File "/workspace/OpenNMT-py/onmt/models/model.py", line 142, in load_state_dict
raise ValueError(
ValueError: Extra keys in model state_dict do not match the model config dict_keys

Only the 1.3B can be fine-tune.

ILG2021 · August 1, 2023, 5:32am

Hello, which GPU do you use for finetune 3.3B? How many memory is needed for finetune 3.3B?

sersh · August 1, 2023, 6:10am

I used A100 GPU with 80Gb memory. Now it is possible to fine-tune with 24Gb memory using lora ( not tried ).

ILG2021 · August 1, 2023, 10:35am

Thank you, I will have a try. 3.3B maybe cost more than 60G am I right?

sersh · August 1, 2023, 1:04pm

depends on batch size

kitkhai · August 21, 2023, 10:13am

Hey Vincent,

Thank you for the tutorial. I am using the nllb-200-600M-onmt.pt checkpoint and have followed every single detail in your tutorial, but while trying to fine tune the model, I’m getting this error in Colab: AssertionError: An error in model’s partition and checkpoint’s slice was detected

Is there something I need to change in my train.yml file?

Thanks!

vince62s · August 22, 2023, 8:04am

can you post your yml file?

kitkhai · August 22, 2023, 10:44am

train.yml:

share_vocab: true
src_vocab: "/content/drive/MyDrive/OpenNMT-py/nllb-200/dictionary2.txt"
src_words_min_frequency: 1
src_vocab_size: 256025
tgt_vocab: "/content/drive/MyDrive/OpenNMT-py/nllb-200/dictionary2.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256025
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "/content/drive/MyDrive/OpenNMT-py/nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "/content/drive/MyDrive/OpenNMT-py/nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    cc-matrix-enzh:
        path_src: "/content/drive/MyDrive/OpenNMT-py/en-zh/cc-matrix-enzh-0to30M.en"
        path_tgt: "/content/drive/MyDrive/OpenNMT-py/en-zh/cc-matrix-enzh-0to30M.zh"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "</s> eng_Latn"
        tgt_prefix: "zho_Hans"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "/content/drive/MyDrive/OpenNMT-py/nllb-200/nllb-200-600M-onmt.pt"
reset_optim: all
save_data: "/content/drive/MyDrive/OpenNMT-py/nllb-200"
save_model: "/content/drive/MyDrive/OpenNMT-py/nllb-200/nllb-200-600M-onmt2.pt"
log_file: "/content/drive/MyDrive/OpenNMT-py/nllb-200/nllb-200-600M-onmt.log"
keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 2000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

Does it have to do with the fact that I used a different checkpoint compared to your tutorial? I’m also not sure if my vocab size is correct but this was the output from the code to modify the SentencePiece model.

vince62s · August 22, 2023, 11:58am

What checkpoint are you talking about?

kitkhai · August 22, 2023, 2:45pm

I am using the nllb-200-600M-onmt.pt checkpoint from the s3 server.

vince62s · August 22, 2023, 4:44pm

then you need:
enc_layers: 12
dec_layers: 12
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 4096

kitkhai · August 23, 2023, 2:05am

Thank you so much! It worked! How may I retrieve such information about the model architecture if I have to finetune another model checkpoint in the future?

Now I get another error message

(/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ')

that seems to be because I was loading too much data at once on my collab notebook, so I think I’ll reduce the amount of data that use (600K) to around 300K

kitkhai · August 24, 2023, 8:21am

Hi again @vince62s

I saw that you used 341K lines so I tried using 300K lines of training data and subsequently 10 lines of training data. However, I was still thrown the same error, even after reduce the batch size to 1. I am using Google collab that provides around 12GB System RAM & 15GB GPU RAM.

(/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 20 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ')

My google search attributes the error message to running out of memory (my system ram seems to be the problem, not the GPU ram) but I’m not sure what else I can change. Currently my batching and optimisation configuration is as follows:

# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 1
valid_batch_size: 1
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]

# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

vince62s · August 25, 2023, 7:43am

batch_size: 1 and batch_type: tokens mean that you use batches of 1 token, non sense.
also don’t use sgd, use adam with a lr of 1e-4

kitkhai · August 28, 2023, 1:22am

Hi @vince62s

The NLLB 200 600M - Transformer checkpoint that I downloaded from OpenNMT-py models - OpenNMT gave me really weird results.

When I ran:

python3 ~/OpenNMT-py/translate.py --config nllb-inference.yaml -src /en-zh/testsets/newstest2019-enzh-src.en -output newstest2019-enzh-hyp.zh

My input English sentence:

My uncle saw that the eagle caught the chickens

My model output in Chinese (Simplified) was complete gibberish and repetitive:

现在,我知道这个问题是什么,我知道这个问题是什么,我认为这是什么.

I used the exact same inference yaml file and only chance the reference to the model checkpoint, hence I am really confused what went wrong.

vince62s · August 29, 2023, 12:57pm

You must be using master and not the last pip version. I will push a fix for this.

You can git pull and try again.

kitkhai · September 1, 2023, 8:39am

Hi

I realised that I don’t quite understand what is token batching in comparison with the more conventional batching? Typically we have sequences of sentences batched together to form a batch. But for token batching, eg 8 token batch size, then what is the length of the sequences of the tokens that are batched together?

Also, why is token batching used/recommended? I don’t quite understand why as I think that by defining a batch by tokens, we could be splitting a sentence up and hence the connections between words in a sentence may be broken? And as such the language model would be able to learn optimally?

ArtanisTheOne · September 1, 2023, 9:33am

Token batching doesn’t split up the sentences itself I believe. It just tries to find the amount of tokens closest to the token batch size you set, that is also a multiple of 8 (traditionally).

Token batching is ‘better’ in this case because it keeps the size of batches more standard. A batch size of 128 sentences would take 128 sentences, regardless of their size. So what can easily happen is the number of tokens in each batch can be wildly different. Token batching fixes that.

i_la_13 · September 18, 2023, 10:33am

Hi Vincent, thank you so much for the tutorial!
I’m trying to finetune NLLB-200 3.3B using LoRa and the training works but when I try to translate some simple sentences then I get “” for all the sentences.
These are my config files for training and inference:
training

# Vocab creation options
share_vocab: true

## Where the vocab(s) is
src_vocab: "dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256206

tgt_vocab: "dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256206

vocab_size_multiple: 1

decoder_start_token: '</s>'


### Transform related opts:

#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0


#Corpus opts:
data:

  corpus_enes:
    path_src: "en-es.en"
    path_tgt: "en-es.es"
    transforms: [sentencepiece, prefix, suffix, filtertoolong]
    src_prefix: "</s> eng_Latn"
    tgt_prefix: "spa_Latn"
    weight: 10
    src_suffix: "" 
    tgt_suffix: ""



#### Filter
src_seq_length: 250
tgt_seq_length: 250


# General opts
update_vocab: true 

train_from: "nllb-200-3.3B-onmt.pt"

reset_optim: all 
save_data: "/nllb-200"
save_model: "trained_models_en_es/nllb-200-en_es"
log_file: "train.log"

keep_checkpoint: -1

save_checkpoint_steps: 5000

average_decay: 0.0005
seed: 1234
report_every: 1
train_steps: 100000 
valid_steps: 5000 

# Batching
bucket_size: 262144
num_workers: 1
prefetch_factor:  400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"

batch_size: 1024                             
valid_batch_size: 1024                        
batch_size_multiple: 2                       
accum_count: [2]

accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "fusedadam" 
learning_rate: 0.1  
warmup_steps: 30 
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
dropout: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

#LoRa
lora_layers: ['linear_values', 'linear_query', 'linnear_keys', 'final_linear']
lora_rank: 8
lora_dropout: 0.05
lora_alpha: 1
lora_embedding: false

# Model
override_opts: true

# For LoRa to work
add_ffnbias: true
add_qkvbias: true

encoder_type: transformer
decoder_type: transformer

enc_layers: 24          
dec_layers: 24          
transformer_ff: 8192    

hidden_size: 2048
word_vec_size: 2048

heads: 16
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

inference:

transforms: [sentencepiece, prefix, suffix]
# nllb-200 specific prefixing and suffixing
src_prefix: "</s> eng_Latn"
tgt_prefix: "spa_Latn" 
tgt_file_prefix: true
src_suffix: ""
tgt_suffix: ""
#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Model info
model: nllb-200-en_es_step_65000.pt
# Inference
max_length: 256
gpu: 0
batch_type: tokens

# for 3,3B model
batch_size: 1024

fp16:
beam_size: 5
report_time: true
log_file: "translate.log"

Do you see anything wrong?

vince62s · September 18, 2023, 11:08am

config needs to be like:

        src_prefix: "eng_Latn"
        tgt_prefix: "deu_Latn"
        src_suffix: "</s>"
        tgt_suffix: ""

When you start logging the ACC/PPL (don’t wait 65000 steps) check that ACC is already very high and PPL low.
if you have a doubt, post the log here of the first 2000 steps