Finetuning bigger models with LoRa (Low-Rank Adaptation) in OpenNMT-py

After merging the weights It raises the following issue.

  File "/home/aman/Documents/geeztranslation/OpenNMT-py/onmt/bin/translate.py", line 60, in main
    translate(opt)
  File "/home/aman/Documents/geeztranslation/OpenNMT-py/onmt/bin/translate.py", line 23, in translate
    translator = build_translator(opt, logger=logger,
  File "/home/aman/Documents/geeztranslation/OpenNMT-py/onmt/translate/translator.py", line 33, in build_translator
    vocabs, model, model_opt = load_test_model(opt)
  File "/home/aman/Documents/geeztranslation/OpenNMT-py/onmt/model_builder.py", line 171, in load_test_model
    model = build_base_model(model_opt, vocabs, checkpoint)
  File "/home/aman/Documents/geeztranslation/OpenNMT-py/onmt/model_builder.py", line 402, in build_base_model
    model.load_state_dict(checkpoint['model'],
  File "/home/aman/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for NMTModel:
	Missing key(s) in state_dict: "encoder.transformer.0.self_attn.linear_keys.bias", "encoder.transformer.0.self_attn.linear_values.bias", "encoder.transformer.0.self_attn.linear_query.bias", "encoder.transformer.0.self_attn.final_linear.bias", "encoder.transformer.1.self_attn.linear_keys.bias", "encoder.transformer.1.self_attn.linear_values.bias",

I did the following trick and the inference run successfully

import torch
m = torch.load("test_3_3B/nllb-200-lora-3_3B_step_10200.pt")
m['opt'].add_qkvbias=False
torch.save(m, "test_3_3B/nllb-200-lora-3_3B_step_10200.pt")

but the prediction was weird…

 ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ 
 ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ 

I had to also reduce the the batch size to 32 for the inference because of an OOM issue.
I finetuned the nllb for a new language not in nllb but I also checked the finetuned model for a language originally in the nllb but the result is the same weird prediction.
There is no problem with the spm model I have checked it separately. Here is my inference config.

config = '''transforms: [sentencepiece, prefix, suffix]
# nllb-200 specific prefixing and suffixing
src_prefix: "eng_Latn"
tgt_prefix: "gez_Ethi"
tgt_file_prefix: true
src_suffix: "</s>"
tgt_suffix: ""


#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Model info
model: "geez_nllb_finetuned_1.pt"
# Inference
max_length: 512
gpu: 0
batch_type: tokens
batch_size: 32
fp16:
beam_size: 5
report_time: true'''

Here size of my models before and after finetuning…

-rw-rw-r-- 1 aman aman  33M May 21 04:24 finetuned/gez_nllb_step_20000.pt
-rw-rw-r-- 1 aman aman 4.8M May 16 08:31 flores200_sacrebleu_tokenizer_spm2.model
-rw-rw-r-- 1 aman aman 2.6G May 23 08:53 geez_nllb_finetuned.pt
-rw-rw-r-- 1 aman aman 4.7M May 11 12:48 nllb-200/flores200_sacrebleu_tokenizer_spm.model
-rw-rw-r-- 1 aman aman 3.6G May 11 12:48 nllb-200/nllb-200-1.3Bdst-onmt.pt 

first:
try to translate with the original nllb-200-1.3Bdst-onmt.pt model with existing languages, to make sure you follow what needs to be done. you should not have OOM with a 32 token batch size.

Second:
post your finetuning log to make sure things were ok during the finetuning. an earlier post (done with SGD) looked strange with very very high accuracies. Post your last training.

I did test the original nllb model with existing language and got 13.4 bleu.

Here is the finetuning log, I used fusedadam but the accuracies are similar to the sgd.

[2023-05-19 12:57:38,706 INFO] Loading checkpoint from nllb-200/nllb-200-1.3Bdst-onmt.pt.1
[2023-05-19 12:57:40,337 WARNING] configured transforms is different from checkpoint: +{'sentencepiece', 'suffix', 'prefix'}
[2023-05-19 12:57:40,337 INFO] Get suffix for cc-matrix-enzh: {'src': '</s>', 'tgt': ''}
[2023-05-19 12:57:40,337 INFO] Get suffix for src infer: 
[2023-05-19 12:57:40,337 INFO] Get suffix for tgt infer: 
[2023-05-19 12:57:40,337 INFO] Get prefix for cc-matrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-19 12:57:40,337 INFO] Get prefix for src infer: 
[2023-05-19 12:57:40,337 INFO] Get prefix for tgt infer: 
[2023-05-19 12:57:40,337 INFO] Get special vocabs from Transforms: {'src': ['</s>', '</s>', 'eng_Latn'], 'tgt': ['gez_Ethi']}.
[2023-05-19 12:57:40,902 INFO] Updating checkpoint vocabulary with new vocabulary
[2023-05-19 12:57:40,903 INFO] Get suffix for cc-matrix-enzh: {'src': '</s>', 'tgt': ''}
[2023-05-19 12:57:40,904 INFO] Get suffix for src infer: 
[2023-05-19 12:57:40,905 INFO] Get suffix for tgt infer: 
[2023-05-19 12:57:40,906 INFO] Get prefix for cc-matrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-19 12:57:40,908 INFO] Get prefix for src infer: 
[2023-05-19 12:57:40,909 INFO] Get prefix for tgt infer: 
[2023-05-19 12:57:40,911 INFO] Get special vocabs from Transforms: {'src': ['</s>', '</s>', 'eng_Latn'], 'tgt': ['gez_Ethi']}.
[2023-05-19 12:57:41,534 INFO] Over-ride model option set to true - use with care
[2023-05-19 12:57:41,534 INFO] Option: config , value: finetuned/finetune.yaml overiding model: 
[2023-05-19 12:57:41,534 INFO] Option: data , value: {'cc-matrix-enzh': {'path_src': 'gmmt/en_train.txt', 'path_tgt': 'gmmt/gez_train.txt', 'transforms': ['sentencepiece', 'prefix', 'suffix', 'filtertoolong'], 'weight': 10, 'src_prefix': '</s> eng_Latn', 'tgt_prefix': 'gez_Ethi', 'src_suffix': '</s>', 'tgt_suffix': '', 'path_align': None}} overiding model: {}
[2023-05-19 12:57:41,534 INFO] Option: skip_empty_level , value: warning overiding model: silent
[2023-05-19 12:57:41,534 INFO] Option: save_data , value: finetuned overiding model: 
[2023-05-19 12:57:41,534 INFO] Option: src_vocab , value: newdictionary.txt overiding model: 
[2023-05-19 12:57:41,534 INFO] Option: tgt_vocab , value: newdictionary.txt overiding model: 
[2023-05-19 12:57:41,534 INFO] Option: src_vocab_size , value: 260926 overiding model: 256206
[2023-05-19 12:57:41,534 INFO] Option: tgt_vocab_size , value: 260926 overiding model: 256206
[2023-05-19 12:57:41,534 INFO] Option: src_subword_model , value: flores200_sacrebleu_tokenizer_spm2.model overiding model: 
[2023-05-19 12:57:41,534 INFO] Option: tgt_subword_model , value: flores200_sacrebleu_tokenizer_spm2.model overiding model: 
[2023-05-19 12:57:41,535 INFO] Option: src_seq_length , value: 192 overiding model: 150
[2023-05-19 12:57:41,535 INFO] Option: tgt_seq_length , value: 192 overiding model: 150
[2023-05-19 12:57:41,535 INFO] Option: update_vocab , value: True overiding model: False
[2023-05-19 12:57:41,535 INFO] Option: add_qkvbias , value: False overiding model: True
[2023-05-19 12:57:41,535 INFO] Option: save_model , value: finetuned/gez_nllb overiding model: nllb
[2023-05-19 12:57:41,535 INFO] Option: save_checkpoint_steps , value: 100 overiding model: 5000
[2023-05-19 12:57:41,535 INFO] Option: train_from , value: nllb-200/nllb-200-1.3Bdst-onmt.pt.1 overiding model: 
[2023-05-19 12:57:41,535 INFO] Option: reset_optim , value: all overiding model: none
[2023-05-19 12:57:41,535 INFO] Option: num_workers , value: 2 overiding model: 4
[2023-05-19 12:57:41,535 INFO] Option: batch_size , value: 256 overiding model: 8192
[2023-05-19 12:57:41,535 INFO] Option: accum_count , value: [32, 32, 32] overiding model: [4]
[2023-05-19 12:57:41,535 INFO] Option: accum_steps , value: [0, 15000, 30000] overiding model: [0]
[2023-05-19 12:57:41,535 INFO] Option: valid_steps , value: 100 overiding model: 5000
[2023-05-19 12:57:41,535 INFO] Option: valid_batch_size , value: 256 overiding model: 4096
[2023-05-19 12:57:41,535 INFO] Option: train_steps , value: 20000 overiding model: 100000
[2023-05-19 12:57:41,535 INFO] Option: optim , value: fusedadam overiding model: 
[2023-05-19 12:57:41,535 INFO] Option: dropout , value: [0.1, 0.1, 0.1] overiding model: [0.1]
[2023-05-19 12:57:41,536 INFO] Option: attention_dropout , value: [0.1, 0.1, 0.1] overiding model: [0.1]
[2023-05-19 12:57:41,536 INFO] Option: dropout_steps , value: [0, 15000, 30000] overiding model: [0]
[2023-05-19 12:57:41,536 INFO] Option: average_decay , value: 0.0005 overiding model: 0.0
[2023-05-19 12:57:41,536 INFO] Option: learning_rate , value: 0.1 overiding model: 5e-05
[2023-05-19 12:57:41,536 INFO] Option: decay_method , value: noam overiding model: none
[2023-05-19 12:57:41,536 INFO] Option: warmup_steps , value: 50 overiding model: 4000
[2023-05-19 12:57:41,536 INFO] Option: log_file , value: finetuned/finetuned.log overiding model: 
[2023-05-19 12:57:41,536 INFO] Option: report_every , value: 10 overiding model: 100
[2023-05-19 12:57:41,536 INFO] Option: _all_transform , value: {'sentencepiece', 'filtertoolong', 'suffix', 'prefix'} overiding model: {'filtertoolong'}
[2023-05-19 12:57:41,536 INFO] Building model...
[2023-05-19 12:57:51,128 INFO] Adding LoRa layers for linear_values
[2023-05-19 12:57:51,924 INFO] Adding LoRa layers for linear_query
[2023-05-19 12:57:52,723 INFO] Adding LoRa layers for linear_keys
[2023-05-19 12:57:53,521 INFO] Adding LoRa layers for final_linear
[2023-05-19 12:58:03,997 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-05-19 12:58:04,384 INFO] src: 260921 new tokens
[2023-05-19 12:58:04,830 INFO] tgt: 260921 new tokens
[2023-05-19 12:58:07,084 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(260926, 1024, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0-23): 24 x TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=8192, bias=True)
          (w_2): Linear(in_features=8192, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(260926, 1024, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
    (transformer_layers): ModuleList(
      (0-23): 24 x TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=8192, bias=True)
          (w_2): Linear(in_features=8192, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (layer_norm_2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      )
    )
  )
  (generator): Linear(in_features=1024, out_features=260926, bias=True)
)
[2023-05-19 12:58:07,092 INFO] encoder: 771219456
[2023-05-19 12:58:07,092 INFO] decoder: 605397822
[2023-05-19 12:58:07,092 INFO] * number of parameters: 1376617278
[2023-05-19 12:58:07,092 INFO]  * src vocab size = 260926
[2023-05-19 12:58:07,092 INFO]  * tgt vocab size = 260926
[2023-05-19 12:58:07,195 INFO] Get suffix for cc-matrix-enzh: {'src': '</s>', 'tgt': ''}
[2023-05-19 12:58:07,195 INFO] Get suffix for src infer: 
[2023-05-19 12:58:07,195 INFO] Get suffix for tgt infer: 
[2023-05-19 12:58:07,196 INFO] Get prefix for cc-matrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-19 12:58:07,196 INFO] Get prefix for src infer: 
[2023-05-19 12:58:07,196 INFO] Get prefix for tgt infer: 
[2023-05-19 12:58:07,274 INFO] Get suffix for cc-matrix-enzh: {'src': '</s>', 'tgt': ''}
[2023-05-19 12:58:07,274 INFO] Get suffix for src infer: 
[2023-05-19 12:58:07,274 INFO] Get suffix for tgt infer: 
[2023-05-19 12:58:07,274 INFO] Get prefix for cc-matrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-19 12:58:07,274 INFO] Get prefix for src infer: 
[2023-05-19 12:58:07,274 INFO] Get prefix for tgt infer: 
[2023-05-19 12:58:07,316 INFO] Starting training on GPU: [0]
[2023-05-19 12:58:07,316 INFO] Start training loop without validation...
[2023-05-19 12:58:07,316 INFO] Scoring with: TransformPipe()
[2023-05-19 13:00:29,289 INFO] Step 10/20000; acc: 83.8; ppl:  38.9; xent: 3.7; lr: 0.00010; sents:    2130; bsz:  229/ 168/ 7; 517/378 tok/s;    142 sec;
[2023-05-19 13:01:39,009 INFO] Step 20/20000; acc: 86.3; ppl:  29.0; xent: 3.4; lr: 0.00019; sents:    1961; bsz:  230/ 167/ 6; 1055/767 tok/s;    212 sec;
[2023-05-19 13:02:48,279 INFO] Step 30/20000; acc: 89.5; ppl:  18.7; xent: 2.9; lr: 0.00027; sents:    1936; bsz:  228/ 166/ 6; 1056/767 tok/s;    281 sec;
[2023-05-19 13:03:57,596 INFO] Step 40/20000; acc: 91.5; ppl:  12.0; xent: 2.5; lr: 0.00036; sents:    2027; bsz:  230/ 169/ 6; 1063/782 tok/s;    350 sec;
[2023-05-19 13:05:06,485 INFO] Step 50/20000; acc: 92.2; ppl:   9.7; xent: 2.3; lr: 0.00044; sents:    2007; bsz:  229/ 167/ 6; 1064/777 tok/s;    419 sec;
[2023-05-19 13:06:15,215 INFO] Step 60/20000; acc: 92.4; ppl:   8.8; xent: 2.2; lr: 0.00040; sents:    1999; bsz:  231/ 167/ 6; 1075/778 tok/s;    488 sec;
[2023-05-19 13:07:24,051 INFO] Step 70/20000; acc: 92.3; ppl:   8.3; xent: 2.1; lr: 0.00037; sents:    2043; bsz:  230/ 168/ 6; 1067/782 tok/s;    557 sec;
[2023-05-19 13:08:33,787 INFO] Step 80/20000; acc: 92.4; ppl:   7.9; xent: 2.1; lr: 0.00035; sents:    2011; bsz:  229/ 168/ 6; 1052/772 tok/s;    626 sec;
[2023-05-19 13:09:43,467 INFO] Step 90/20000; acc: 92.3; ppl:   7.7; xent: 2.0; lr: 0.00033; sents:    2002; bsz:  230/ 167/ 6; 1056/765 tok/s;    696 sec;
[2023-05-19 13:10:53,795 INFO] Step 100/20000; acc: 92.5; ppl:   7.3; xent: 2.0; lr: 0.00031; sents:    2007; bsz:  231/ 167/ 6; 1049/762 tok/s;    766 sec;
[2023-05-19 13:10:53,795 INFO] Train perplexity: 12.3101
[2023-05-19 13:10:53,795 INFO] Train accuracy: 90.5041
[2023-05-19 13:10:53,796 INFO] Sentences processed: 20123
[2023-05-19 13:10:53,796 INFO] Average bsz:  230/ 167/ 6
[2023-05-19 13:10:53,877 INFO] Saving checkpoint finetuned/gez_nllb_step_100.pt
[2023-05-19 13:12:04,014 INFO] Step 110/20000; acc: 93.0; ppl:   6.8; xent: 1.9; lr: 0.00030; sents:    1993; bsz:  229/ 168/ 6; 1046/765 tok/s;    837 sec;
[2023-05-19 13:13:13,447 INFO] Step 120/20000; acc: 94.5; ppl:   6.4; xent: 1.9; lr: 0.00028; sents:    2046; bsz:  229/ 167/ 6; 1054/768 tok/s;    906 sec;
[2023-05-19 13:14:22,857 INFO] Step 130/20000; acc: 96.1; ppl:   5.8; xent: 1.8; lr: 0.00027; sents:    2041; bsz:  233/ 169/ 6; 1074/780 tok/s;    976 sec;
[2023-05-19 13:15:32,233 INFO] Step 140/20000; acc: 96.2; ppl:   5.6; xent: 1.7; lr: 0.00026; sents:    2051; bsz:  232/ 170/ 6; 1069/784 tok/s;   1045 sec;
[2023-05-19 13:16:41,809 INFO] Step 150/20000; acc: 96.3; ppl:   5.6; xent: 1.7; lr: 0.00025; sents:    1973; bsz:  230/ 168/ 6; 1060/772 tok/s;   1114 sec;
[2023-05-19 13:17:52,356 INFO] Step 160/20000; acc: 96.2; ppl:   5.6; xent: 1.7; lr: 0.00025; sents:    2043; bsz:  230/ 169/ 6; 1042/767 tok/s;   1185 sec;
[2023-05-19 13:19:01,503 INFO] Step 170/20000; acc: 96.3; ppl:   5.5; xent: 1.7; lr: 0.00024; sents:    1949; bsz:  230/ 165/ 6; 1063/765 tok/s;   1254 sec;
[2023-05-19 13:20:10,587 INFO] Step 180/20000; acc: 96.2; ppl:   5.5; xent: 1.7; lr: 0.00023; sents:    2026; bsz:  231/ 169/ 6; 1070/783 tok/s;   1323 sec;
[2023-05-19 13:21:20,314 INFO] Step 190/20000; acc: 95.9; ppl:   5.6; xent: 1.7; lr: 0.00023; sents:    2159; bsz:  233/ 170/ 7; 1068/780 tok/s;   1393 sec;
[2023-05-19 13:22:29,105 INFO] Step 200/20000; acc: 96.1; ppl:   5.5; xent: 1.7; lr: 0.00022; sents:    2057; bsz:  229/ 167/ 6; 1067/778 tok/s;   1462 sec;
[2023-05-19 13:22:29,105 INFO] Train perplexity: 8.43188
[2023-05-19 13:22:29,105 INFO] Train accuracy: 93.1023
[2023-05-19 13:22:29,105 INFO] Sentences processed: 40461
[2023-05-19 13:22:29,105 INFO] Average bsz:  230/ 168/ 6
[2023-05-19 13:22:29,184 INFO] Saving checkpoint finetuned/gez_nllb_step_200.pt
[2023-05-19 13:23:37,871 INFO] Step 210/20000; acc: 96.0; ppl:   5.5; xent: 1.7; lr: 0.00022; sents:    2164; bsz:  230/ 170/ 7; 1070/790 tok/s;   1531 sec;
[2023-05-19 13:24:46,878 INFO] Step 220/20000; acc: 96.1; ppl:   5.5; xent: 1.7; lr: 0.00021; sents:    2110; bsz:  231/ 170/ 7; 1072/787 tok/s;   1600 sec;
[2023-05-19 13:25:55,373 INFO] Step 230/20000; acc: 96.1; ppl:   5.5; xent: 1.7; lr: 0.00021; sents:    2084; bsz:  230/ 168/ 7; 1073/787 tok/s;   1668 sec;
[2023-05-19 13:27:04,151 INFO] Step 240/20000; acc: 96.0; ppl:   5.5; xent: 1.7; lr: 0.00020; sents:    2122; bsz:  231/ 167/ 7; 1076/778 tok/s;   1737 sec;
[2023-05-19 13:28:14,167 INFO] Step 250/20000; acc: 96.2; ppl:   5.5; xent: 1.7; lr: 0.00020; sents:    2035; bsz:  230/ 170/ 6; 1050/775 tok/s;   1807 sec;
[2023-05-19 13:29:22,822 INFO] Step 260/20000; acc: 96.2; ppl:   5.5; xent: 1.7; lr: 0.00019; sents:    2011; bsz:  230/ 167/ 6; 1072/780 tok/s;   1876 sec;
[2023-05-19 13:30:32,497 INFO] Step 270/20000; acc: 96.3; ppl:   5.5; xent: 1.7; lr: 0.00019; sents:    1980; bsz:  230/ 166/ 6; 1055/764 tok/s;   1945 sec;
[2023-05-19 13:31:41,765 INFO] Step 280/20000; acc: 96.3; ppl:   5.5; xent: 1.7; lr: 0.00019; sents:    1963; bsz:  231/ 167/ 6; 1066/770 tok/s;   2014 sec;
[2023-05-19 13:32:51,875 INFO] Step 290/20000; acc: 96.1; ppl:   5.5; xent: 1.7; lr: 0.00018; sents:    2112; bsz:  230/ 168/ 7; 1048/766 tok/s;   2085 sec;
[2023-05-19 13:34:01,315 INFO] Step 300/20000; acc: 96.2; ppl:   5.5; xent: 1.7; lr: 0.00018; sents:    2048; bsz:  231/ 168/ 6; 1063/773 tok/s;   2154 sec;
[2023-05-19 13:34:01,315 INFO] Train perplexity: 7.31255
[2023-05-19 13:34:01,315 INFO] Train accuracy: 94.1191
[2023-05-19 13:34:01,315 INFO] Sentences processed: 61090
[2023-05-19 13:34:01,315 INFO] Average bsz:  230/ 168/ 6
[2023-05-19 13:34:01,397 INFO] Saving checkpoint finetuned/gez_nllb_step_300.pt
[2023-05-19 13:35:10,268 INFO] Step 310/20000; acc: 96.1; ppl:   5.5; xent: 1.7; lr: 0.00018; sents:    2025; bsz:  229/ 165/ 6; 

Here is the train config…

share_vocab: true
src_vocab: "newdictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 260926
tgt_vocab: "newdictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 260926
vocab_size_multiple: 1
decoder_start_token: '</s>'

#LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 2
lora_dropout: 0.0
lora_alpha: 1
lora_embedding: false


#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    cc-matrix-enzh:
        path_src: "gmmt/en_train.txt"
        path_tgt: "gmmt/gez_train.txt"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "</s> eng_Latn"
        tgt_prefix: "gez_Ethi"
        src_suffix: "</s>"
        tgt_suffix: ""
update_vocab: true
train_from: "nllb-200/nllb-200-1.3Bdst-onmt.pt.1"
reset_optim: all
save_data: "finetuned"
save_model: "finetuned/gez_nllb"
log_file: "finetuned/finetuned.log"
keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 20000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 2
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 256
valid_batch_size: 256
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "fusedadam"
learning_rate: 0.1
warmup_steps: 50
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'
        src_prefix: "</s> eng_Latn"
        tgt_prefix: "gez_Ethi"
        src_suffix: "</s>"
        tgt_suffix: ""

should be

        src_prefix: "eng_Latn"
        tgt_prefix: "gez_Ethi"
        src_suffix: "</s>"
        tgt_suffix: ""

but this is probably not the reason. I don’t know what language you are trying to add, but this must come from your training data. Accuracy cannot be that high unless source and target are very very close.

Please, this kind of accuracy is all I see whenever I train a model using opennmt. I have trained a couple of bilingual and multilingual models. The ‘acc’ at the end is around 100 but the blue scores at the end looks reasonable(around 10). That can mean all my experiments were wrong. I don’t actually understand what ‘acc’ is, the value is different from the train accuracy. I’ve actually checked the documentation about this but didn’t get explanation. Please help…

Open a new thread with title
“Adding language Geez Ehtiopian to NLLB”

put:
the new vocabulary entries
the steps you did to modify the spm model
repost your training conifg
post some training samples, and indicate the size of you training set
post the command line you used to train and infer.

thanks, just trying to avoid unrelated info in this thread.

1 Like

I try to fineture the nllb 1.3b model with LoRa. But get error when translate. Did i miss something?

File "/root/.conda/envs/opneNMT/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 160, in forward
    return F.embedding(
  File "/root/.conda/envs/opneNMT/lib/python3.10/site-packages/torch/nn/functional.py", line 2210, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Here is what I have done:
add new character to the dictionary then modify the spm model.
train the lora
merge the lora weight.

Here is my config.yaml:

share_vocab: true
src_vocab: "/root/ai/NLLB-Finetune/dic3.txt"
src_words_min_frequency: 1
src_vocab_size: 278728
tgt_vocab: "/root/ai/NLLB-Finetune/dic3.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 278728
vocab_size_multiple: 1
decoder_start_token: '</s>'


#LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 4
lora_dropout: 0.0
lora_alpha: 1
lora_embedding: false


#### Subword
src_subword_model: "/root/ai/NLLB-Finetune/newspm.model"
tgt_subword_model: "/root/ai/NLLB-Finetune/newspm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    ccmatrix-enzh:
        path_src: "/root/ai/NLLB-Finetune/shoot_data/zh-ko/game-data.zh_cn-filtered.zh.subword.train"
        path_tgt: "/root/ai/NLLB-Finetune/shoot_data/zh-ko/game-data.ko-filtered.ko.subword.train"
        transforms: [sentencepiece, prefix, suffix,filtertoolong]
        weight: 10
        src_prefix: "</s> zho_Hans"
        tgt_prefix: "kor_Hang"
        src_suffix: ""
        tgt_suffix: ""
update_vocab: true
train_from: "/root/ai/models/nllb-200-1.3b-onmt.pt"
reset_optim: "all"
save_data: "nllb-200"
save_model: "/root/ai/NLLB-Finetune/nllb-200/nllb-200-1.3b-dic3.pt"
log_file: "/root/ai/NLLB-Finetune/nllb-200/nllb-200.log"
keep_checkpoint: 5
save_checkpoint_steps: 10
average_decay: 0.0005
seed: 1234
report_every: 100
train_steps: 10
valid_steps: 1000
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 2
gpu_ranks: [0,1]
batch_type: "tokens"
batch_size: 512
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "sgd"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'
add_ffnbias: true
add_qkvbias: true

It seems to be a problem of embedding size . How can i fix it?

Does anyone success fine-tune 3.3B on 4090? When I finetune, it crash by:

[2023-08-02 21:46:11,717 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-08-02 21:46:13,824 INFO] src: 1078 new tokens
[2023-08-02 21:46:18,171 INFO] tgt: 1078 new tokens
Traceback (most recent call last):
  File "/workspace/OpenNMT-py/train.py", line 6, in <module>
    main()
  File "/workspace/OpenNMT-py/onmt/bin/train.py", line 67, in main
    train(opt)
  File "/workspace/OpenNMT-py/onmt/bin/train.py", line 52, in train
    train_process(opt, device_id=0)
  File "/workspace/OpenNMT-py/onmt/train_single.py", line 165, in main
    model = build_model(model_opt, opt, vocabs, checkpoint)
  File "/workspace/OpenNMT-py/onmt/model_builder.py", line 412, in build_model
    model.load_state_dict(
  File "/workspace/OpenNMT-py/onmt/models/model.py", line 142, in load_state_dict
    raise ValueError(
ValueError: Extra keys in model state_dict do not match the model config dict_keys(['encoder.embeddings.make_embedding.pe.pe', 'encoder.transformer.0.self_attn.linear_keys.bias', 'encoder.transformer.0.self_attn.linear_values.bias', 'encoder.transformer.0.self_attn.linear_query.bias', 'encoder.transformer.0.self_attn.final_linear.bias', 'encoder.transformer.0.feed_forward.w_1.bias', 'encoder.transformer.0.feed_forward.w_2.bias', 'encoder.transformer.1.self_attn.linear_keys.bias', 'encoder.transformer.1.self_attn.linear_values.bias', 'encoder.transformer.1.self_attn.linear_query.bias', 'encoder.transformer.1.self_attn.final_linear.bias', 'encoder.transformer.1.feed_forward.w_1.bias', 'encoder.transformer.1.feed_forward.w_2.bias', 'encoder.transformer.2.self_attn.linear_keys.bias', 'encoder.transformer.2.self_attn.linear_values.bias', 'encoder.transformer.2.self_attn.linear_query.bias', 'encoder.transformer.2.self_attn.final_linear.bias', 'encoder.transformer.2.feed_forward.w_1.bias', 'encoder.transformer.2.feed_forward.w_2.bias', 'encoder.transformer.3.self_attn.linear_keys.bias', 'encoder.transformer.3.self_attn.linear_values.bias', 'encoder.transformer.3.self_attn.linear_query.bias', 'encoder.transformer.3.self_attn.final_linear.bias', 'encoder.transformer.3.feed_forward.w_1.bias', 'encoder.transformer.3.feed_forward.w_2.bias', 'encoder.transformer.4.self_attn.linear_keys.bias', 'encoder.transformer.4.self_attn.linear_values.bias', 'encoder.transformer.4.self_attn.linear_query.bias', 'encoder.transformer.4.self_attn.final_linear.bias', 'encoder.transformer.4.feed_forward.w_1.bias', 'encoder.transformer.4.feed_forward.w_2.bias', 'encoder.transformer.5.self_attn.linear_keys.bias', 'encoder.transformer.5.self_attn.linear_values.bias', 'encoder.transformer.5.self_attn.linear_query.bias', 'encoder.transformer.5.self_attn.final_linear.bias', 'encoder.transformer.5.feed_forward.w_1.bias', 'encoder.transformer.5.feed_forward.w_2.bias', 'encoder.transformer.6.self_attn.linear_keys.bias', 'encoder.transformer.6.self_attn.linear_values.bias', 'encoder.transformer.6.self_attn.linear_query.bias', 'encoder.transformer.6.self_attn.final_linear.bias', 'encoder.transformer.6.feed_forward.w_1.bias', 'encoder.transformer.6.feed_forward.w_2.bias', 'encoder.transformer.7.self_attn.linear_keys.bias', 'encoder.transformer.7.self_attn.linear_values.bias', 'encoder.transformer.7.self_attn.linear_query.bias', 'encoder.transformer.7.self_attn.final_linear.bias', 'encoder.transformer.7.feed_forward.w_1.bias', 'encoder.transformer.7.feed_forward.w_2.bias', 'encoder.transformer.8.self_attn.linear_keys.bias', 'encoder.transformer.8.self_attn.linear_values.bias', 'encoder.transformer.8.self_attn.linear_query.bias', 'encoder.transformer.8.self_attn.final_linear.bias', 'encoder.transformer.8.feed_forward.w_1.bias', 'encoder.transformer.8.feed_forward.w_2.bias', 'encoder.transformer.9.self_attn.linear_keys.bias', 'encoder.transformer.9.self_attn.linear_values.bias', 'encoder.transformer.9.self_attn.linear_query.bias', 'encoder.transformer.9.self_attn.final_linear.bias', 'encoder.transformer.9.feed_forward.w_1.bias', 'encoder.transformer.9.feed_forward.w_2.bias', 'encoder.transformer.10.self_attn.linear_keys.bias', 'encoder.transformer.10.self_attn.linear_values.bias', 'encoder.transformer.10.self_attn.linear_query.bias', 'encoder.transformer.10.self_attn.final_linear.bias', 'encoder.transformer.10.feed_forward.w_1.bias', 'encoder.transformer.10.feed_forward.w_2.bias', 'encoder.transformer.11.self_attn.linear_keys.bias', 'encoder.transformer.11.self_attn.linear_values.bias', 'encoder.transformer.11.self_attn.linear_query.bias', 'encoder.transformer.11.self_attn.final_linear.bias', 'encoder.transformer.11.feed_forward.w_1.bias', 'encoder.transformer.11.feed_forward.w_2.bias', 'encoder.transformer.12.self_attn.linear_keys.bias', 'encoder.transformer.12.self_attn.linear_values.bias', 'encoder.transformer.12.self_attn.linear_query.bias', 'encoder.transformer.12.self_attn.final_linear.bias', 'encoder.transformer.12.feed_forward.w_1.bias', 'encoder.transformer.12.feed_forward.w_2.bias', 'encoder.transformer.13.self_attn.linear_keys.bias', 'encoder.transformer.13.self_attn.linear_values.bias', 'encoder.transformer.13.self_attn.linear_query.bias', 'encoder.transformer.13.self_attn.final_linear.bias', 'encoder.transformer.13.feed_forward.w_1.bias', 'encoder.transformer.13.feed_forward.w_2.bias', 'encoder.transformer.14.self_attn.linear_keys.bias', 'encoder.transformer.14.self_attn.linear_values.bias', 'encoder.transformer.14.self_attn.linear_query.bias', 'encoder.transformer.14.self_attn.final_linear.bias', 'encoder.transformer.14.feed_forward.w_1.bias', 'encoder.transformer.14.feed_forward.w_2.bias', 'encoder.transformer.15.self_attn.linear_keys.bias', 'encoder.transformer.15.self_attn.linear_values.bias', 'encoder.transformer.15.self_attn.linear_query.bias', 'encoder.transformer.15.self_attn.final_linear.bias', 'encoder.transformer.15.feed_forward.w_1.bias', 'encoder.transformer.15.feed_forward.w_2.bias', 'encoder.transformer.16.self_attn.linear_keys.bias', 'encoder.transformer.16.self_attn.linear_values.bias', 'encoder.transformer.16.self_attn.linear_query.bias', 'encoder.transformer.16.self_attn.final_linear.bias', 'encoder.transformer.16.feed_forward.w_1.bias', 'encoder.transformer.16.feed_forward.w_2.bias', 'encoder.transformer.17.self_attn.linear_keys.bias', 'encoder.transformer.17.self_attn.linear_values.bias', 'encoder.transformer.17.self_attn.linear_query.bias', 'encoder.transformer.17.self_attn.final_linear.bias', 'encoder.transformer.17.feed_forward.w_1.bias', 'encoder.transformer.17.feed_forward.w_2.bias', 'encoder.transformer.18.self_attn.linear_keys.bias', 'encoder.transformer.18.self_attn.linear_values.bias', 'encoder.transformer.18.self_attn.linear_query.bias', 'encoder.transformer.18.self_attn.final_linear.bias', 'encoder.transformer.18.feed_forward.w_1.bias', 'encoder.transformer.18.feed_forward.w_2.bias', 'encoder.transformer.19.self_attn.linear_keys.bias', 'encoder.transformer.19.self_attn.linear_values.bias', 'encoder.transformer.19.self_attn.linear_query.bias', 'encoder.transformer.19.self_attn.final_linear.bias', 'encoder.transformer.19.feed_forward.w_1.bias', 'encoder.transformer.19.feed_forward.w_2.bias', 'encoder.transformer.20.self_attn.linear_keys.bias', 'encoder.transformer.20.self_attn.linear_values.bias', 'encoder.transformer.20.self_attn.linear_query.bias', 'encoder.transformer.20.self_attn.final_linear.bias', 'encoder.transformer.20.feed_forward.w_1.bias', 'encoder.transformer.20.feed_forward.w_2.bias', 'encoder.transformer.21.self_attn.linear_keys.bias', 'encoder.transformer.21.self_attn.linear_values.bias', 'encoder.transformer.21.self_attn.linear_query.bias', 'encoder.transformer.21.self_attn.final_linear.bias', 'encoder.transformer.21.feed_forward.w_1.bias', 'encoder.transformer.21.feed_forward.w_2.bias', 'encoder.transformer.22.self_attn.linear_keys.bias', 'encoder.transformer.22.self_attn.linear_values.bias', 'encoder.transformer.22.self_attn.linear_query.bias', 'encoder.transformer.22.self_attn.final_linear.bias', 'encoder.transformer.22.feed_forward.w_1.bias', 'encoder.transformer.22.feed_forward.w_2.bias', 'encoder.transformer.23.self_attn.linear_keys.bias', 'encoder.transformer.23.self_attn.linear_values.bias', 'encoder.transformer.23.self_attn.linear_query.bias', 'encoder.transformer.23.self_attn.final_linear.bias', 'encoder.transformer.23.feed_forward.w_1.bias', 'encoder.transformer.23.feed_forward.w_2.bias', 'decoder.embeddings.make_embedding.pe.pe', 'decoder.transformer_layers.0.self_attn.linear_keys.bias', 'decoder.transformer_layers.0.self_attn.linear_values.bias', 'decoder.transformer_layers.0.self_attn.linear_query.bias', 'decoder.transformer_layers.0.self_attn.final_linear.bias', 'decoder.transformer_layers.0.context_attn.linear_keys.bias', 'decoder.transformer_layers.0.context_attn.linear_values.bias', 'decoder.transformer_layers.0.context_attn.linear_query.bias', 'decoder.transformer_layers.0.context_attn.final_linear.bias', 'decoder.transformer_layers.0.feed_forward.w_1.bias', 'decoder.transformer_layers.0.feed_forward.w_2.bias', 'decoder.transformer_layers.1.self_attn.linear_keys.bias', 'decoder.transformer_layers.1.self_attn.linear_values.bias', 'decoder.transformer_layers.1.self_attn.linear_query.bias', 'decoder.transformer_layers.1.self_attn.final_linear.bias', 'decoder.transformer_layers.1.context_attn.linear_keys.bias', 'decoder.transformer_layers.1.context_attn.linear_values.bias', 'decoder.transformer_layers.1.context_attn.linear_query.bias', 'decoder.transformer_layers.1.context_attn.final_linear.bias', 'decoder.transformer_layers.1.feed_forward.w_1.bias', 'decoder.transformer_layers.1.feed_forward.w_2.bias', 'decoder.transformer_layers.2.self_attn.linear_keys.bias', 'decoder.transformer_layers.2.self_attn.linear_values.bias', 'decoder.transformer_layers.2.self_attn.linear_query.bias', 'decoder.transformer_layers.2.self_attn.final_linear.bias', 'decoder.transformer_layers.2.context_attn.linear_keys.bias', 'decoder.transformer_layers.2.context_attn.linear_values.bias', 'decoder.transformer_layers.2.context_attn.linear_query.bias', 'decoder.transformer_layers.2.context_attn.final_linear.bias', 'decoder.transformer_layers.2.feed_forward.w_1.bias', 'decoder.transformer_layers.2.feed_forward.w_2.bias', 'decoder.transformer_layers.3.self_attn.linear_keys.bias', 'decoder.transformer_layers.3.self_attn.linear_values.bias', 'decoder.transformer_layers.3.self_attn.linear_query.bias', 'decoder.transformer_layers.3.self_attn.final_linear.bias', 'decoder.transformer_layers.3.context_attn.linear_keys.bias', 'decoder.transformer_layers.3.context_attn.linear_values.bias', 'decoder.transformer_layers.3.context_attn.linear_query.bias', 'decoder.transformer_layers.3.context_attn.final_linear.bias', 'decoder.transformer_layers.3.feed_forward.w_1.bias', 'decoder.transformer_layers.3.feed_forward.w_2.bias', 'decoder.transformer_layers.4.self_attn.linear_keys.bias', 'decoder.transformer_layers.4.self_attn.linear_values.bias', 'decoder.transformer_layers.4.self_attn.linear_query.bias', 'decoder.transformer_layers.4.self_attn.final_linear.bias', 'decoder.transformer_layers.4.context_attn.linear_keys.bias', 'decoder.transformer_layers.4.context_attn.linear_values.bias', 'decoder.transformer_layers.4.context_attn.linear_query.bias', 'decoder.transformer_layers.4.context_attn.final_linear.bias', 'decoder.transformer_layers.4.feed_forward.w_1.bias', 'decoder.transformer_layers.4.feed_forward.w_2.bias', 'decoder.transformer_layers.5.self_attn.linear_keys.bias', 'decoder.transformer_layers.5.self_attn.linear_values.bias', 'decoder.transformer_layers.5.self_attn.linear_query.bias', 'decoder.transformer_layers.5.self_attn.final_linear.bias', 'decoder.transformer_layers.5.context_attn.linear_keys.bias', 'decoder.transformer_layers.5.context_attn.linear_values.bias', 'decoder.transformer_layers.5.context_attn.linear_query.bias', 'decoder.transformer_layers.5.context_attn.final_linear.bias', 'decoder.transformer_layers.5.feed_forward.w_1.bias', 'decoder.transformer_layers.5.feed_forward.w_2.bias', 'decoder.transformer_layers.6.self_attn.linear_keys.bias', 'decoder.transformer_layers.6.self_attn.linear_values.bias', 'decoder.transformer_layers.6.self_attn.linear_query.bias', 'decoder.transformer_layers.6.self_attn.final_linear.bias', 'decoder.transformer_layers.6.context_attn.linear_keys.bias', 'decoder.transformer_layers.6.context_attn.linear_values.bias', 'decoder.transformer_layers.6.context_attn.linear_query.bias', 'decoder.transformer_layers.6.context_attn.final_linear.bias', 'decoder.transformer_layers.6.feed_forward.w_1.bias', 'decoder.transformer_layers.6.feed_forward.w_2.bias', 'decoder.transformer_layers.7.self_attn.linear_keys.bias', 'decoder.transformer_layers.7.self_attn.linear_values.bias', 'decoder.transformer_layers.7.self_attn.linear_query.bias', 'decoder.transformer_layers.7.self_attn.final_linear.bias', 'decoder.transformer_layers.7.context_attn.linear_keys.bias', 'decoder.transformer_layers.7.context_attn.linear_values.bias', 'decoder.transformer_layers.7.context_attn.linear_query.bias', 'decoder.transformer_layers.7.context_attn.final_linear.bias', 'decoder.transformer_layers.7.feed_forward.w_1.bias', 'decoder.transformer_layers.7.feed_forward.w_2.bias', 'decoder.transformer_layers.8.self_attn.linear_keys.bias', 'decoder.transformer_layers.8.self_attn.linear_values.bias', 'decoder.transformer_layers.8.self_attn.linear_query.bias', 'decoder.transformer_layers.8.self_attn.final_linear.bias', 'decoder.transformer_layers.8.context_attn.linear_keys.bias', 'decoder.transformer_layers.8.context_attn.linear_values.bias', 'decoder.transformer_layers.8.context_attn.linear_query.bias', 'decoder.transformer_layers.8.context_attn.final_linear.bias', 'decoder.transformer_layers.8.feed_forward.w_1.bias', 'decoder.transformer_layers.8.feed_forward.w_2.bias', 'decoder.transformer_layers.9.self_attn.linear_keys.bias', 'decoder.transformer_layers.9.self_attn.linear_values.bias', 'decoder.transformer_layers.9.self_attn.linear_query.bias', 'decoder.transformer_layers.9.self_attn.final_linear.bias', 'decoder.transformer_layers.9.context_attn.linear_keys.bias', 'decoder.transformer_layers.9.context_attn.linear_values.bias', 'decoder.transformer_layers.9.context_attn.linear_query.bias', 'decoder.transformer_layers.9.context_attn.final_linear.bias', 'decoder.transformer_layers.9.feed_forward.w_1.bias', 'decoder.transformer_layers.9.feed_forward.w_2.bias', 'decoder.transformer_layers.10.self_attn.linear_keys.bias', 'decoder.transformer_layers.10.self_attn.linear_values.bias', 'decoder.transformer_layers.10.self_attn.linear_query.bias', 'decoder.transformer_layers.10.self_attn.final_linear.bias', 'decoder.transformer_layers.10.context_attn.linear_keys.bias', 'decoder.transformer_layers.10.context_attn.linear_values.bias', 'decoder.transformer_layers.10.context_attn.linear_query.bias', 'decoder.transformer_layers.10.context_attn.final_linear.bias', 'decoder.transformer_layers.10.feed_forward.w_1.bias', 'decoder.transformer_layers.10.feed_forward.w_2.bias', 'decoder.transformer_layers.11.self_attn.linear_keys.bias', 'decoder.transformer_layers.11.self_attn.linear_values.bias', 'decoder.transformer_layers.11.self_attn.linear_query.bias', 'decoder.transformer_layers.11.self_attn.final_linear.bias', 'decoder.transformer_layers.11.context_attn.linear_keys.bias', 'decoder.transformer_layers.11.context_attn.linear_values.bias', 'decoder.transformer_layers.11.context_attn.linear_query.bias', 'decoder.transformer_layers.11.context_attn.final_linear.bias', 'decoder.transformer_layers.11.feed_forward.w_1.bias', 'decoder.transformer_layers.11.feed_forward.w_2.bias', 'decoder.transformer_layers.12.self_attn.linear_keys.bias', 'decoder.transformer_layers.12.self_attn.linear_values.bias', 'decoder.transformer_layers.12.self_attn.linear_query.bias', 'decoder.transformer_layers.12.self_attn.final_linear.bias', 'decoder.transformer_layers.12.context_attn.linear_keys.bias', 'decoder.transformer_layers.12.context_attn.linear_values.bias', 'decoder.transformer_layers.12.context_attn.linear_query.bias', 'decoder.transformer_layers.12.context_attn.final_linear.bias', 'decoder.transformer_layers.12.feed_forward.w_1.bias', 'decoder.transformer_layers.12.feed_forward.w_2.bias', 'decoder.transformer_layers.13.self_attn.linear_keys.bias', 'decoder.transformer_layers.13.self_attn.linear_values.bias', 'decoder.transformer_layers.13.self_attn.linear_query.bias', 'decoder.transformer_layers.13.self_attn.final_linear.bias', 'decoder.transformer_layers.13.context_attn.linear_keys.bias', 'decoder.transformer_layers.13.context_attn.linear_values.bias', 'decoder.transformer_layers.13.context_attn.linear_query.bias', 'decoder.transformer_layers.13.context_attn.final_linear.bias', 'decoder.transformer_layers.13.feed_forward.w_1.bias', 'decoder.transformer_layers.13.feed_forward.w_2.bias', 'decoder.transformer_layers.14.self_attn.linear_keys.bias', 'decoder.transformer_layers.14.self_attn.linear_values.bias', 'decoder.transformer_layers.14.self_attn.linear_query.bias', 'decoder.transformer_layers.14.self_attn.final_linear.bias', 'decoder.transformer_layers.14.context_attn.linear_keys.bias', 'decoder.transformer_layers.14.context_attn.linear_values.bias', 'decoder.transformer_layers.14.context_attn.linear_query.bias', 'decoder.transformer_layers.14.context_attn.final_linear.bias', 'decoder.transformer_layers.14.feed_forward.w_1.bias', 'decoder.transformer_layers.14.feed_forward.w_2.bias', 'decoder.transformer_layers.15.self_attn.linear_keys.bias', 'decoder.transformer_layers.15.self_attn.linear_values.bias', 'decoder.transformer_layers.15.self_attn.linear_query.bias', 'decoder.transformer_layers.15.self_attn.final_linear.bias', 'decoder.transformer_layers.15.context_attn.linear_keys.bias', 'decoder.transformer_layers.15.context_attn.linear_values.bias', 'decoder.transformer_layers.15.context_attn.linear_query.bias', 'decoder.transformer_layers.15.context_attn.final_linear.bias', 'decoder.transformer_layers.15.feed_forward.w_1.bias', 'decoder.transformer_layers.15.feed_forward.w_2.bias', 'decoder.transformer_layers.16.self_attn.linear_keys.bias', 'decoder.transformer_layers.16.self_attn.linear_values.bias', 'decoder.transformer_layers.16.self_attn.linear_query.bias', 'decoder.transformer_layers.16.self_attn.final_linear.bias', 'decoder.transformer_layers.16.context_attn.linear_keys.bias', 'decoder.transformer_layers.16.context_attn.linear_values.bias', 'decoder.transformer_layers.16.context_attn.linear_query.bias', 'decoder.transformer_layers.16.context_attn.final_linear.bias', 'decoder.transformer_layers.16.feed_forward.w_1.bias', 'decoder.transformer_layers.16.feed_forward.w_2.bias', 'decoder.transformer_layers.17.self_attn.linear_keys.bias', 'decoder.transformer_layers.17.self_attn.linear_values.bias', 'decoder.transformer_layers.17.self_attn.linear_query.bias', 'decoder.transformer_layers.17.self_attn.final_linear.bias', 'decoder.transformer_layers.17.context_attn.linear_keys.bias', 'decoder.transformer_layers.17.context_attn.linear_values.bias', 'decoder.transformer_layers.17.context_attn.linear_query.bias', 'decoder.transformer_layers.17.context_attn.final_linear.bias', 'decoder.transformer_layers.17.feed_forward.w_1.bias', 'decoder.transformer_layers.17.feed_forward.w_2.bias', 'decoder.transformer_layers.18.self_attn.linear_keys.bias', 'decoder.transformer_layers.18.self_attn.linear_values.bias', 'decoder.transformer_layers.18.self_attn.linear_query.bias', 'decoder.transformer_layers.18.self_attn.final_linear.bias', 'decoder.transformer_layers.18.context_attn.linear_keys.bias', 'decoder.transformer_layers.18.context_attn.linear_values.bias', 'decoder.transformer_layers.18.context_attn.linear_query.bias', 'decoder.transformer_layers.18.context_attn.final_linear.bias', 'decoder.transformer_layers.18.feed_forward.w_1.bias', 'decoder.transformer_layers.18.feed_forward.w_2.bias', 'decoder.transformer_layers.19.self_attn.linear_keys.bias', 'decoder.transformer_layers.19.self_attn.linear_values.bias', 'decoder.transformer_layers.19.self_attn.linear_query.bias', 'decoder.transformer_layers.19.self_attn.final_linear.bias', 'decoder.transformer_layers.19.context_attn.linear_keys.bias', 'decoder.transformer_layers.19.context_attn.linear_values.bias', 'decoder.transformer_layers.19.context_attn.linear_query.bias', 'decoder.transformer_layers.19.context_attn.final_linear.bias', 'decoder.transformer_layers.19.feed_forward.w_1.bias', 'decoder.transformer_layers.19.feed_forward.w_2.bias', 'decoder.transformer_layers.20.self_attn.linear_keys.bias', 'decoder.transformer_layers.20.self_attn.linear_values.bias', 'decoder.transformer_layers.20.self_attn.linear_query.bias', 'decoder.transformer_layers.20.self_attn.final_linear.bias', 'decoder.transformer_layers.20.context_attn.linear_keys.bias', 'decoder.transformer_layers.20.context_attn.linear_values.bias', 'decoder.transformer_layers.20.context_attn.linear_query.bias', 'decoder.transformer_layers.20.context_attn.final_linear.bias', 'decoder.transformer_layers.20.feed_forward.w_1.bias', 'decoder.transformer_layers.20.feed_forward.w_2.bias', 'decoder.transformer_layers.21.self_attn.linear_keys.bias', 'decoder.transformer_layers.21.self_attn.linear_values.bias', 'decoder.transformer_layers.21.self_attn.linear_query.bias', 'decoder.transformer_layers.21.self_attn.final_linear.bias', 'decoder.transformer_layers.21.context_attn.linear_keys.bias', 'decoder.transformer_layers.21.context_attn.linear_values.bias', 'decoder.transformer_layers.21.context_attn.linear_query.bias', 'decoder.transformer_layers.21.context_attn.final_linear.bias', 'decoder.transformer_layers.21.feed_forward.w_1.bias', 'decoder.transformer_layers.21.feed_forward.w_2.bias', 'decoder.transformer_layers.22.self_attn.linear_keys.bias', 'decoder.transformer_layers.22.self_attn.linear_values.bias', 'decoder.transformer_layers.22.self_attn.linear_query.bias', 'decoder.transformer_layers.22.self_attn.final_linear.bias', 'decoder.transformer_layers.22.context_attn.linear_keys.bias', 'decoder.transformer_layers.22.context_attn.linear_values.bias', 'decoder.transformer_layers.22.context_attn.linear_query.bias', 'decoder.transformer_layers.22.context_attn.final_linear.bias', 'decoder.transformer_layers.22.feed_forward.w_1.bias', 'decoder.transformer_layers.22.feed_forward.w_2.bias', 'decoder.transformer_layers.23.self_attn.linear_keys.bias', 'decoder.transformer_layers.23.self_attn.linear_values.bias', 'decoder.transformer_layers.23.self_attn.linear_query.bias', 'decoder.transformer_layers.23.self_attn.final_linear.bias', 'decoder.transformer_layers.23.context_attn.linear_keys.bias', 'decoder.transformer_layers.23.context_attn.linear_values.bias', 'decoder.transformer_layers.23.context_attn.linear_query.bias', 'decoder.transformer_layers.23.context_attn.final_linear.bias', 'decoder.transformer_layers.23.feed_forward.w_1.bias', 'decoder.transformer_layers.23.feed_forward.w_2.bias'])

my train config is:

share_vocab: true
src_vocab: "./nllb-200/dictionary2.txt"
src_words_min_frequency: 1
src_vocab_size: 257284
tgt_vocab: "./nllb-200/dictionary2.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 257284
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "./nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "./nllb-200/flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    corpus_1:
        path_src: "./nllb-200/dataset.tl"
        path_tgt: "./nllb-200/dataset.zh"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "tgl_Latn"
        tgt_prefix: "zho_Hans"
        src_suffix: "</s>"
        tgt_suffix: ""
update_vocab: true
train_from: "./nllb-200/nllb-200-3.3B-onmt.pt"
reset_optim: all
save_data: "nllb-200"
save_model: "./nllb-200/nllb-200-3.3B-onmt.pt"
log_file: "./nllb-200/nllb-200-3.3B-onmt.log"
keep_checkpoint: 100
save_checkpoint_steps: 4000
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 4000
valid_steps: 100
# Batching
bucket_size: 262144
num_workers: 4
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 512
valid_batch_size: 384
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]
# Optimization
model_dtype: "fp16"
optim: "fusedadam"
learning_rate: 30
warmup_steps: 100
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 2048
word_vec_size: 2048
transformer_ff: 8192
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'

#LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 4
lora_dropout: 0.0
lora_alpha: 1
lora_embedding: false

maybe you have to add
add_ffnbias: true
to model config.

1 Like

I will have a try.

you need also: add_qkvbias: true

After add add_ffnbias:true and add_qkvbias: true, the crash has been fixed. But it still oom. I have change the optium to sgd and adam, even change batch_size to 1, it still oom.

As I remember, Lora won’t work with SGD. Correct me if I’m wrong