Fine tuning nllb-200-distilled-600M model

Hello,

I was trying to fine tune nllb models using OpenNMT-py by referring the document. And, successfully ran the script for 1.3B model. But getting error while fine tuning the nllb-200-600M-onmt.pt. Attaching the screenshot here. Please help me to resolve this.

Regards,
Anjaly

you need to change the dimensions of the model accordingly in the yaml file

That worked !! Thanks a lot @vince62s. Changed transformer_ff to 4096 in yaml file.

@vince62s ,

I am fine tuning models for Indian languages. All the predictions starts with ‘??’. Here is my sample prediction:

Source text : "हे फरीसियों, तुम पर हाय ! तुम आराधनालयों में मुख्य-मुख्य आसन और बाजारों में नमस्कार चाहते हो।

Pred_1.3B : ⁇ "हे फरीसियाँ, थारै पै धिक्कार सै! थम आराधनालयां म्ह प्रधान-मुख्य आसन अर बाज़ार म्ह नमस्कार चाहो सों।

Is it like that ? Or am I missing anything here?

you need to give more info on source language, target language, what you did for finetuning (config file) and you inference file + commanf line

@vince62s,

My source language is Hindi and target language is Haryanvi, a minority language of Hindi. 6 characters were missing in the dictionary and I added them. Rebuilt SPM referring the article I mentioned above. Commands used are:
onmt_train --config nllb-train.yaml
onmt_translate --config nllb-inference.yaml -src src-test.hin -output tgt-test-hyp.har

nllb-train.yaml:

share_vocab: true
src_vocab: "nllb-200/dictionary2.txt"
src_words_min_frequency: 1
src_vocab_size: 256212
tgt_vocab: "nllb-200/dictionary2.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256212
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    bible_data:
        path_src: "bible_data/hin-har/src-train.hin"
        path_tgt: "bible_data/hin-har/tgt-train.har"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "</s> hin_Deva"
        tgt_prefix: "har_Deva"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "nllb-200/nllb-200-600M-onmt.pt"
reset_optim: all
save_data: "nllb-200"
save_model: "nllb-200/nllb-200-600M-onmt_2000_steps"
log_file: "nllb-200/nllb-200-600M-onmt.log"
keep_checkpoint: 50
save_checkpoint_steps: 2000
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 2000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 4096
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

nllb-inference.yaml

transforms: [sentencepiece, prefix, suffix]
# nllb-200 specific prefixing and suffixing
src_prefix: "hin_Deva"
tgt_prefix: "har_Deva"
tgt_file_prefix: true
src_suffix: "</s>"
tgt_suffix: ""

#### Subword
src_subword_model: "nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Model info
model: "nllb-200/nllb-200-600M-onmt_2000_steps_step_2000.pt"
# Inference
max_length: 512
gpu: 0
batch_type: tokens
batch_size: 2048
fp16:
beam_size: 5
report_time: true

Thanks,
Anjaly

can you post the training log ?

you need to train with the same settings as in the inference config
src_prefix: "hin_Deva"
src_suffix: "</s>"

Hi @vince62s ,

Sorry for the late reply.

I had tried what you have suggested. Keeping same settings in train and inference config. Attaching the config files and log here.

nllb-train.yaml

share_vocab: true
src_vocab: "nllb-200/dictionary2.txt"
src_words_min_frequency: 1
src_vocab_size: 256212
tgt_vocab: "nllb-200/dictionary2.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256212
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    bible_data:
        path_src: "bible_data/hin-har/src-train.hin"
        path_tgt: "bible_data/hin-har/tgt-train.har"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "hin_Deva"
        tgt_prefix: "har_Deva"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "nllb-200/nllb-200-600M-onmt.pt"
reset_optim: all
save_data: "nllb-200"
save_model: "nllb-200/2nllb-200-600M-onmt_2000_steps"
log_file: "nllb-200/nllb-200-600M-onmt.log"
keep_checkpoint: 50
save_checkpoint_steps: 2000
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 2000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 4096
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

nllb-inference.yaml:

transforms: [sentencepiece, prefix, suffix]
# nllb-200 specific prefixing and suffixing
src_prefix: "hin_Deva"
tgt_prefix: "har_Deva"
tgt_file_prefix: true
src_suffix: ""
tgt_suffix: ""

#### Subword
src_subword_model: "nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Model info
model: "nllb-200/2nllb-200-600M-onmt_2000_steps_step_2000.pt"
# Inference
max_length: 512
gpu: 0
batch_type: tokens
batch_size: 2048
fp16:
beam_size: 5
report_time: true

Not able to post full training log as showing number of lines exceeded.
training log

[2023-06-21 16:46:48,871 INFO] Parsed 1 corpora from -data.
[2023-06-21 16:46:48,888 INFO] Loading checkpoint from nllb-200/nllb-200-600M-onmt.pt
[2023-06-21 16:47:06,693 WARNING] configured transforms is different from checkpoint: +{'sentencepiece', 'prefix', 'suffix'}
[2023-06-21 16:47:06,693 INFO] Get prefix for bible_data: {'src': 'hin_Deva', 'tgt': 'har_Deva'}
[2023-06-21 16:47:06,693 INFO] Get prefix for src infer: 
[2023-06-21 16:47:06,693 INFO] Get prefix for tgt infer: 
[2023-06-21 16:47:06,693 INFO] Get suffix for bible_data: {'src': '', 'tgt': ''}
[2023-06-21 16:47:06,693 INFO] Get suffix for src infer: 
[2023-06-21 16:47:06,693 INFO] Get suffix for tgt infer: 
[2023-06-21 16:47:06,693 INFO] Get special vocabs from Transforms: {'src': ['hin_Deva'], 'tgt': ['har_Deva']}.
[2023-06-21 16:47:07,110 INFO] Updating checkpoint vocabulary with new vocabulary
[2023-06-21 16:47:07,112 INFO] Get prefix for bible_data: {'src': 'hin_Deva', 'tgt': 'har_Deva'}
[2023-06-21 16:47:07,113 INFO] Get prefix for src infer: 
[2023-06-21 16:47:07,114 INFO] Get prefix for tgt infer: 
[2023-06-21 16:47:07,115 INFO] Get suffix for bible_data: {'src': '', 'tgt': ''}
[2023-06-21 16:47:07,117 INFO] Get suffix for src infer: 
[2023-06-21 16:47:07,118 INFO] Get suffix for tgt infer: 
[2023-06-21 16:47:07,120 INFO] Get special vocabs from Transforms: {'src': ['hin_Deva'], 'tgt': ['har_Deva']}.
[2023-06-21 16:47:07,575 INFO] Building model...
[2023-06-21 16:47:16,739 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-06-21 16:47:17,551 INFO] src: 6 new tokens
[2023-06-21 16:47:19,140 INFO] tgt: 6 new tokens
[2023-06-21 16:47:26,729 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(256212, 1024, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=4096, bias=True)
          (w_2): Linear(in_features=4096, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (1): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=4096, bias=True)
          (w_2): Linear(in_features=4096, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      .
      .
      .
      (23): TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=4096, bias=True)
          (w_2): Linear(in_features=4096, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(256212, 1024, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
    (transformer_layers): ModuleList(
      (0): TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=4096, bias=True)
          (w_2): Linear(in_features=4096, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (layer_norm_2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      )
      (1): TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=4096, bias=True)
          (w_2): Linear(in_features=4096, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (layer_norm_2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      )
      .
      .
      .
      
      (23): TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=4096, bias=True)
          (w_2): Linear(in_features=4096, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (layer_norm_2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      )
    )
  )
  (generator): Linear(in_features=1024, out_features=256212, bias=True)
)
[2023-06-21 16:47:26,736 INFO] encoder: 564574208
[2023-06-21 16:47:26,736 INFO] decoder: 403181780
[2023-06-21 16:47:26,736 INFO] * number of parameters: 967755988
[2023-06-21 16:47:26,736 INFO]  * src vocab size = 256212
[2023-06-21 16:47:26,736 INFO]  * tgt vocab size = 256212
[2023-06-21 16:47:26,823 INFO] Get prefix for bible_data: {'src': 'hin_Deva', 'tgt': 'har_Deva'}
[2023-06-21 16:47:26,823 INFO] Get prefix for src infer: 
[2023-06-21 16:47:26,823 INFO] Get prefix for tgt infer: 
[2023-06-21 16:47:26,823 INFO] Get suffix for bible_data: {'src': '', 'tgt': ''}
[2023-06-21 16:47:26,823 INFO] Get suffix for src infer: 
[2023-06-21 16:47:26,823 INFO] Get suffix for tgt infer: 
[2023-06-21 16:47:26,866 INFO] Get prefix for bible_data: {'src': 'hin_Deva', 'tgt': 'har_Deva'}
[2023-06-21 16:47:26,866 INFO] Get prefix for src infer: 
[2023-06-21 16:47:26,866 INFO] Get prefix for tgt infer: 
[2023-06-21 16:47:26,866 INFO] Get suffix for bible_data: {'src': '', 'tgt': ''}
[2023-06-21 16:47:26,866 INFO] Get suffix for src infer: 
[2023-06-21 16:47:26,866 INFO] Get suffix for tgt infer: 
[2023-06-21 16:47:26,893 INFO] Starting training on GPU: [0]
[2023-06-21 16:47:26,893 INFO] Start training loop without validation...
[2023-06-21 16:47:26,893 INFO] Scoring with: TransformPipe()
[2023-06-21 16:49:08,263 INFO] Step 10/ 2000; acc: 3.1; ppl: 45189.8; xent: 10.7; lr: 0.01031; sents:    2484; bsz:  244/ 348/ 8; 770/1098 tok/s;    101 sec;
[2023-06-21 16:49:49,802 INFO] Step 20/ 2000; acc: 9.7; ppl: 1709.6; xent: 7.4; lr: 0.01969; sents:    2300; bsz:  242/ 348/ 7; 1868/2683 tok/s;    143 sec;
[2023-06-21 16:50:31,358 INFO] Step 30/ 2000; acc: 15.2; ppl: 722.0; xent: 6.6; lr: 0.02906; sents:    2189; bsz:  239/ 346/ 7; 1844/2662 tok/s;    184 sec;
[2023-06-21 16:51:13,050 INFO] Step 40/ 2000; acc: 19.3; ppl: 463.1; xent: 6.1; lr: 0.03844; sents:    2295; bsz:  242/ 348/ 7; 1859/2674 tok/s;    226 sec;
[2023-06-21 16:51:54,659 INFO] Step 50/ 2000; acc: 22.0; ppl: 337.4; xent: 5.8; lr: 0.04781; sents:    2148; bsz:  239/ 348/ 7; 1841/2675 tok/s;    268 sec;
[2023-06-21 16:52:36,455 INFO] Step 60/ 2000; acc: 24.5; ppl: 263.4; xent: 5.6; lr: 0.05719; sents:    2297; bsz:  243/ 348/ 7; 1863/2668 tok/s;    310 sec;
[2023-06-21 16:53:18,253 INFO] Step 70/ 2000; acc: 27.8; ppl: 217.8; xent: 5.4; lr: 0.06656; sents:    2343; bsz:  244/ 348/ 7; 1865/2664 tok/s;    351 sec;
[2023-06-21 16:54:00,225 INFO] Step 80/ 2000; acc: 29.9; ppl: 188.5; xent: 5.2; lr: 0.07594; sents:    2361; bsz:  245/ 351/ 7; 1867/2677 tok/s;    393 sec;
[2023-06-21 16:54:42,041 INFO] Step 90/ 2000; acc: 31.1; ppl: 169.5; xent: 5.1; lr: 0.08531; sents:    2132; bsz:  240/ 348/ 7; 1840/2663 tok/s;    435 sec;
[2023-06-21 16:55:23,932 INFO] Step 100/ 2000; acc: 33.6; ppl: 145.6; xent: 5.0; lr: 0.09328; sents:    2342; bsz:  244/ 347/ 7; 1862/2652 tok/s;    477 sec;
[2023-06-21 16:55:23,932 INFO] Train perplexity: 544.799
[2023-06-21 16:55:23,932 INFO] Train accuracy: 21.6312
[2023-06-21 16:55:23,932 INFO] Sentences processed: 22891
[2023-06-21 16:55:23,932 INFO] Average bsz:  242/ 348/ 7
.
.
.
[2023-06-21 19:01:25,673 INFO] Step 1910/ 2000; acc: 57.2; ppl:  36.1; xent: 3.6; lr: 0.02145; sents:    2147; bsz:  239/ 347/ 7; 1849/2683 tok/s;   8039 sec;
[2023-06-21 19:02:07,077 INFO] Step 1920/ 2000; acc: 57.6; ppl:  35.6; xent: 3.6; lr: 0.02139; sents:    2325; bsz:  243/ 348/ 7; 1881/2693 tok/s;   8080 sec;
[2023-06-21 19:02:48,975 INFO] Step 1930/ 2000; acc: 57.9; ppl:  35.1; xent: 3.6; lr: 0.02133; sents:    2345; bsz:  244/ 349/ 7; 1863/2666 tok/s;   8122 sec;
[2023-06-21 19:03:30,707 INFO] Step 1940/ 2000; acc: 57.5; ppl:  35.7; xent: 3.6; lr: 0.02128; sents:    2259; bsz:  242/ 347/ 7; 1855/2664 tok/s;   8164 sec;
[2023-06-21 19:04:12,552 INFO] Step 1950/ 2000; acc: 57.4; ppl:  36.1; xent: 3.6; lr: 0.02122; sents:    2257; bsz:  241/ 346/ 7; 1841/2648 tok/s;   8206 sec;
[2023-06-21 19:04:54,544 INFO] Step 1960/ 2000; acc: 56.9; ppl:  36.6; xent: 3.6; lr: 0.02117; sents:    2339; bsz:  244/ 349/ 7; 1859/2657 tok/s;   8248 sec;
[2023-06-21 19:05:36,495 INFO] Step 1970/ 2000; acc: 57.8; ppl:  35.1; xent: 3.6; lr: 0.02112; sents:    2364; bsz:  246/ 350/ 7; 1874/2672 tok/s;   8290 sec;
[2023-06-21 19:06:18,303 INFO] Step 1980/ 2000; acc: 57.4; ppl:  35.8; xent: 3.6; lr: 0.02106; sents:    2236; bsz:  241/ 348/ 7; 1842/2660 tok/s;   8331 sec;
[2023-06-21 19:07:00,243 INFO] Step 1990/ 2000; acc: 58.1; ppl:  34.6; xent: 3.5; lr: 0.02101; sents:    2503; bsz:  246/ 350/ 8; 1880/2671 tok/s;   8373 sec;
[2023-06-21 19:07:42,084 INFO] Step 2000/ 2000; acc: 57.6; ppl:  35.3; xent: 3.6; lr: 0.02096; sents:    2268; bsz:  242/ 349/ 7; 1852/2667 tok/s;   8415 sec;
[2023-06-21 19:07:42,084 INFO] Train perplexity: 53.5222
[2023-06-21 19:07:42,085 INFO] Train accuracy: 50.8492
[2023-06-21 19:07:42,085 INFO] Sentences processed: 463366
[2023-06-21 19:07:42,085 INFO] Average bsz:  243/ 349/ 7
[2023-06-21 19:07:42,194 INFO] Saving checkpoint nllb-200/2nllb-200-600M-onmt_2000_steps_step_2000.pt

Thanks for your time!

sorry my previous reply did not display correctly the </s> I fixed it.
But in any case I don’t think this is the issue.
the first steps start with a very low accuracy. the most common issue with this is a mismatch in the vocab indices. Maybe check the training without the vocab change to see if you observe a much higher accuracy.

Hi Vincent, I am also trying to fine-tune nllb-200-distilled-600M model on EN-DE language pair and I am experiencing the same issue: the first steps start with a very low accuracy / high perplexity as if training was restarting from scratch.
I tried fine-tuning both with the original vocabulary (https://opennmt-models.s3.amazonaws.com/nllb-200/dictionary.txt) and spm model (https://opennmt-models.s3.amazonaws.com/nllb-200/flores200_sacrebleu_tokenizer_spm.model) and with modified vocabulary and spm model following the tutorial. I experienced the same issue in both cases.

Here is the config file I used for fine-tuning:

data:
     corpus:
          path_src: "train.eng"
          path_tgt: "train.deu"
     valid:
          path_src: "valid.eng"
          path_tgt: "valid.deu"      
transforms: [sentencepiece, prefix, suffix, filtertoolong]
train_from: "nllb-200/nllb-200-600M-onmt.pt"
save_model: "nllb-200/runs/"   
#subwords
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
src_subword_model: "nllb-200/flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "nllb-200/flores200_sacrebleu_tokenizer_spm.model"
src_prefix: "</s> eng_Latn"
tgt_prefix: "deu_Latn"
src_suffix: ""
tgt_suffix: ""
decoder_start_token: "</s>"
#vocab
share_vocab: true
update_vocab: true
vocab_size_multiple: 1
src_vocab: "nllb-200/dictionary.txt"
tgt_vocab: "nllb-200/dictionary.txt"
src_vocab_size: 256206
tgt_vocab_size: 256206
src_words_min_frequency: 1
tgt_words_min_frequency: 1
#batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
batch_type: "tokens"
batch_size: 256
valid_batch_size: 256
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
world_size: 2
gpu_ranks: [0,1]
#general
train_steps: 2000
valid_steps: 100
train_eval_steps: 100
report_every: 10
seed: 1234
keep_checkpoints: -1
save_checkpoints_steps: 1000
#optimization
model_dtype: "fp16"
fp16:
reset_optim: "all"
optim: "fusedadam"
learning_rate: 2.0
adam_beta1: 0.9
adam_beta2: 0.998
max_grad_norm: 0
decay_method: "noam"
warmum_steps: 100
normalization: "tokens"
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
skip_empty_level: silent
#model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 12
dec_layers: 12
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 4096
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

I tried removing “</s>” from the src_prefix and putting it in src_suffix instead, I also tried different optimizers, learning rates and batch sizes but I always encounter the same issue. I understand it probably comes from a mismatch in the vocab indices. I compared “dictionary.txt” with the vocab obtained by applying extract_vocabulary.py to the pretrained model and they seem to be identical. When using the option “update_vocab: true”, no new tokens are found.

Would you have any idea regarding the origin of the issue? Thanks a lot!

I just did this with no modification to the vocab / spm model:

share_vocab: true
src_vocab: "/media/vincent/Crucial X6/dataAI/nllb-200/dictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 256206
tgt_vocab: "/media/vincent/Crucial X6/dataAI/nllb-200/dictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 256206
vocab_size_multiple: 1

# Corpus opts:
data:
    cc-matrix-0to50M-20-13:
        path_src: "/media/vincent/Crucial X6/dataAI/en-de/cc-matrix-ende-0to50M.scored.20-13.filtered.lsh.en"
        path_tgt: "/media/vincent/Crucial X6/dataAI/en-de/cc-matrix-ende-0to50M.scored.20-13.filtered.lsh.de"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "eng_Latn"
        tgt_prefix: "deu_Latn"
        src_suffix: "</s>"
        tgt_suffix: ""

    valid:
        path_src: "/media/vincent/Crucial X6/dataAI/en-de/testsets/newstest2021-src.en"
        path_tgt: "/media/vincent/Crucial X6/dataAI/en-de/testsets/newstest2021-refC.de"
        transforms: [sentencepiece, prefix, suffix]
        src_prefix: "eng_Latn"
        tgt_prefix: "</s> deu_Latn"
        src_suffix: "</s> "
        tgt_suffix: ""        

decoder_start_token: '</s>'
#### Subword
src_subword_model: "/media/vincent/Crucial\ X6/dataAI/nllb-200/flores200_sacrebleu_tokenizer_spm.model"
tgt_subword_model: "/media/vincent/Crucial\ X6/dataAI/nllb-200/flores200_sacrebleu_tokenizer_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
#### Filter
src_seq_length: 96
tgt_seq_length: 96

# silently ignore empty lines in the data
skip_empty_level: silent

# General opts
update_vocab: true
train_from: "/media/vincent/Crucial X6/dataAI/nllb-200/nllb-200-600M-onmt.pt"
reset_optim: all
save_data: "/media/vincent/Crucial X6/dataAI/nllb-200"
save_model: "/media/vincent/Crucial X6/dataAI/nllb-200/nllb-200-600M-onmt"
log_file: "/media/vincent/Crucial X6/dataAI/nllb-200/nllb-200-600M-onmt.log"
keep_checkpoint: 10
save_checkpoint_steps: 1000
seed: 1234
report_every: 100
train_steps: 10000
valid_steps: 1000

# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 384
valid_batch_size: 128
batch_size_multiple: 1
accum_count: [32]
accum_steps: [0]

# Optimization
model_dtype: "fp16"
optim: "fusedadam"
learning_rate: 0.0001
warmup_steps: 100
decay_method: "none"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 12
dec_layers: 12
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 4096
add_qkvbias: true
add_ffnbias: true
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'
[2023-06-26 17:56:00,435 INFO] Starting training on GPU: [0]
[2023-06-26 17:56:00,436 INFO] Start training loop and validate every 1000 steps...
[2023-06-26 17:56:00,436 INFO] Scoring with: TransformPipe(SentencePieceTransform(share_vocab=True, src_subword_model=/media/vincent/Crucial X6/dataAI/nllb-200/flores200_sacrebleu_tokenizer_spm.model, tgt_subword_model=/media/vincent/Crucial X6/dataAI/nllb-200/flores200_sacrebleu_tokenizer_spm.model, src_subword_alpha=0.0, tgt_subword_alpha=0.0, src_subword_vocab=, tgt_subword_vocab=, src_vocab_threshold=0, tgt_vocab_threshold=0, src_subword_nbest=1, tgt_subword_nbest=1), PrefixTransform(prefix_dict={'cc-matrix-0to50M-20-13': {'src': 'eng_Latn', 'tgt': 'deu_Latn'}, 'valid': {'src': 'eng_Latn', 'tgt': '</s> deu_Latn'}, 'infer': {'src': 'eng_Latn', 'tgt': '</s> deu_Latn'}}), SuffixTransform(suffix_dict={'cc-matrix-0to50M-20-13': {'src': '</s>', 'tgt': ''}, 'valid': {'src': '</s> ', 'tgt': ''}, 'infer': {'src': '</s> ', 'tgt': ''}}))
[2023-06-26 17:56:01,280 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 1
[2023-06-26 17:56:01,998 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 1
[2023-06-26 17:56:02,766 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 1
[2023-06-26 17:56:03,535 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 1
[2023-06-26 17:56:06,640 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 2
[2023-06-26 17:56:06,701 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 2
[2023-06-26 17:56:06,726 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 2
[2023-06-26 17:56:06,777 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 2
[2023-06-26 17:56:10,752 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 3
[2023-06-26 17:56:10,846 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 3
[2023-06-26 17:56:10,872 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 3
[2023-06-26 17:56:10,912 INFO] Weighted corpora loaded so far:
			* cc-matrix-0to50M-20-13: 3
[2023-06-26 18:00:30,270 INFO] Step 100/10000; acc: 82.4; ppl:  10.3; xent: 2.3; lr: 0.00010; sents:   34348; bsz:  291/ 354/11; 3451/4194 tok/s;    270 sec;

see the accuracy.

difficult for me to help further.

1 Like

Hi @vince62s ,

Is there any difference in using OpenNMT-py by installing through pip and installing though source?
I was doing my experiments using pip installation, and used onmt_train, onmt_translate for training and making inference respectively.

No, they should be the same. Only difference would be that building through source will be more updated with the changes made to the master/main branch but they update the package on pip every 1 or 2 weeks.