Adding language Geez Ethiopian to NLLB

Finetuning and Curating NLLB-200 with OpenNMT-py
I followed this tutorial to add the Ge’ez language to the NLLB model which is not originally in the model.

I first trained a sentence piece model.

spm.SentencePieceTrainer.train(input='gmmt/shared_train.txt', 
                         model_prefix='shared_gmmt_spm', 
                         vocab_size=8000, model_type='bpe')

I then build the vocab using the trained spm model and the opennmt build vocab tool to have the vocab in opennmt format(may be not important to do this :slightly_smiling_face:). Here is the config I using to build the vocab.

share_vocab: true
src_vocab: "gmmt/dictionary1.txt"
src_words_min_frequency: 1
src_vocab_size: 256232
tgt_vocab: "gmmt/dictionary1.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 8000
vocab_size_multiple: 1
decoder_start_token: '</s>'
#### Subword
src_subword_model: "shared_gmmt_spm.model"
tgt_subword_model: "shared_gmmt_spm.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Corpus opts:
data:
    en-gez-gmmt:
        path_src: "gmmt/shared_train.txt"
        path_tgt: "gmmt/shared_train.txt"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "</s> eng_Latn"
        tgt_prefix: "gez_Ethi"
        src_suffix: ""
        tgt_suffix: ""

update_vocab: true
save_data: "gmmt"
overwrite: true
onmt_build_vocab -config en_gez.yaml  -n_sample -1

I then added the tokens in the new dictionary to the the nllb dictionary like this…

with open('gmmt/dictionary1.txt', 'r') as file:
    gmmt_tokens = [line.strip().split()[0] for line in file]
    
with open('nllb-200/dictionary.txt', 'r') as file:
    nllb_tokens = [line.strip().split()[0] for line in file]

added_tokens = set(gmmt_tokens).difference(set(nllb_tokens))

newtokens = nllb_tokens[:-3] + list(added_tokens) + nllb_tokens[-3:]

with open('newdictionary.txt', 'w') as file:
    for line in newtokens:
        file.write(f"{line} 1 \n")

Here I added the new tokens to the nllb spm model, the script is the same as the one here Finetuning and Curating NLLB-200 with OpenNMT-py. I just added the new language token(gez_Ethi) to the tok_exclusion list.

from unicodedata2 import *
from collections import Counter
from tqdm import tqdm
import sentencepiece as spm
import sentencepiece_model_pb2 as model


tok_exclusion = ['<s>', '<blank>', '</s>', '<unk>', 'gez_Ethi', 'ace_Arab', 'ace_Latn', 'acm_Arab', 'acq_Arab', 'aeb_Arab', 'afr_Latn', 'ajp_Arab', 'aka_Latn', 'amh_Ethi', 'apc_Arab', 'arb_Arab', 'ars_Arab', 'ary_Arab', 'arz_Arab', 'asm_Beng', 'ast_Latn', 'awa_Deva', 'ayr_Latn', 'azb_Arab', 'azj_Latn', 'bak_Cyrl', 'bam_Latn', 'ban_Latn', 'bel_Cyrl', 'bem_Latn', 'ben_Beng', 'bho_Deva', 'bjn_Arab', 'bjn_Latn', 'bod_Tibt', 'bos_Latn', 'bug_Latn', 'bul_Cyrl', 'cat_Latn', 'ceb_Latn', 'ces_Latn', 'cjk_Latn', 'ckb_Arab', 'crh_Latn', 'cym_Latn', 'dan_Latn', 'deu_Latn', 'dik_Latn', 'dyu_Latn', 'dzo_Tibt', 'ell_Grek', 'eng_Latn', 'epo_Latn', 'est_Latn', 'eus_Latn', 'ewe_Latn', 'fao_Latn', 'pes_Arab', 'fij_Latn', 'fin_Latn', 'fon_Latn', 'fra_Latn', 'fur_Latn', 'fuv_Latn', 'gla_Latn', 'gle_Latn', 'glg_Latn', 'grn_Latn', 'guj_Gujr', 'hat_Latn', 'hau_Latn', 'heb_Hebr', 'hin_Deva', 'hne_Deva', 'hrv_Latn', 'hun_Latn', 'hye_Armn', 'ibo_Latn', 'ilo_Latn', 'ind_Latn', 'isl_Latn', 'ita_Latn', 'jav_Latn', 'jpn_Jpan', 'kab_Latn', 'kac_Latn', 'kam_Latn', 'kan_Knda', 'kas_Arab', 'kas_Deva', 'kat_Geor', 'knc_Arab', 'knc_Latn', 'kaz_Cyrl', 'kbp_Latn', 'kea_Latn', 'khm_Khmr', 'kik_Latn', 'kin_Latn', 'kir_Cyrl', 'kmb_Latn', 'kon_Latn', 'kor_Hang', 'kmr_Latn', 'lao_Laoo', 'lvs_Latn', 'lij_Latn', 'lim_Latn', 'lin_Latn', 'lit_Latn', 'lmo_Latn', 'ltg_Latn', 'ltz_Latn', 'lua_Latn', 'lug_Latn', 'luo_Latn', 'lus_Latn', 'mag_Deva', 'mai_Deva', 'mal_Mlym', 'mar_Deva', 'min_Latn', 'mkd_Cyrl', 'plt_Latn', 'mlt_Latn', 'mni_Beng', 'khk_Cyrl', 'mos_Latn', 'mri_Latn', 'zsm_Latn', 'mya_Mymr', 'nld_Latn', 'nno_Latn', 'nob_Latn', 'npi_Deva', 'nso_Latn', 'nus_Latn', 'nya_Latn', 'oci_Latn', 'gaz_Latn', 'ory_Orya', 'pag_Latn', 'pan_Guru', 'pap_Latn', 'pol_Latn', 'por_Latn', 'prs_Arab', 'pbt_Arab', 'quy_Latn', 'ron_Latn', 'run_Latn', 'rus_Cyrl', 'sag_Latn', 'san_Deva', 'sat_Beng', 'scn_Latn', 'shn_Mymr', 'sin_Sinh', 'slk_Latn', 'slv_Latn', 'smo_Latn', 'sna_Latn', 'snd_Arab', 'som_Latn', 'sot_Latn', 'spa_Latn', 'als_Latn', 'srd_Latn', 'srp_Cyrl', 'ssw_Latn', 'sun_Latn', 'swe_Latn', 'swh_Latn', 'szl_Latn', 'tam_Taml', 'tat_Cyrl', 'tel_Telu', 'tgk_Cyrl', 'tgl_Latn', 'tha_Thai', 'tir_Ethi', 'taq_Latn', 'taq_Tfng', 'tpi_Latn', 'tsn_Latn', 'tso_Latn', 'tuk_Latn', 'tum_Latn', 'tur_Latn', 'twi_Latn', 'tzm_Tfng', 'uig_Arab', 'ukr_Cyrl', 'umb_Latn', 'urd_Arab', 'uzn_Latn', 'vec_Latn', 'vie_Latn', 'war_Latn', 'wol_Latn', 'xho_Latn', 'ydd_Hebr', 'yor_Latn', 'yue_Hant', 'zho_Hans', 'zho_Hant', 'zul_Latn', '<pad1>', '<pad2>', '<pad3>', '<inv>']


newdict2 = []
with open('newdictionary.txt', 'r', encoding='utf-8') as f:
    for line in f:
        token = line.strip().split()[0]
        newdict2.append(token)


serializedStr=open('nllb-200/flores200_sacrebleu_tokenizer_spm.model', 'rb').read()
m=model.ModelProto()
m.ParseFromString(serializedStr)
curdict = []
for i in tqdm(range(len(m.pieces) - 1, 2, -1)):
    curdict.append(m.pieces[i].piece)
    if m.pieces[i].piece not in newdict2:
        hex_string = "".join("{:02x}".format(ord(c)) for c in m.pieces[i].piece)
        print("Removing: ", hex_string, " from spm model, not in dict. Index: ", i)
        m.pieces.pop(i)

for tok in tqdm(newdict2):
    if (tok not in curdict) and (tok not in tok_exclusion):
        print("Adding: ", tok, " to spm model")
        newtoken = m.SentencePiece()
        newtoken.piece = tok
        newtoken.score = 0
        m.pieces.append(newtoken)
        
print(len(m.pieces))
        
with open('flores200_sacrebleu_tokenizer_spm2.model', 'wb') as f:
    f.write(m.SerializeToString())
  0%|          | 228/255997 [00:00<09:45, 436.81it/s]
Removing:  85  from spm model, not in dict. Index:  255860
100%|██████████| 255997/255997 [04:58<00:00, 857.77it/s]  
 98%|█████████▊| 256024/260926 [04:43<00:00, 21191.42it/s]
Adding:  ▁weep  to spm model
Adding:  ▁ወለያ  to spm model
Adding:  ልአ  to spm model
Adding:  ▁ወፋ  to spm model
Adding:  ፃረ  to spm model
Adding:  ▁ይቅት  to spm model
Adding:  baal  to spm model
Adding:  ▁ጽድቀ  to spm model
Adding:  ▁calf  to spm model
Adding:  ▁ወአና  to spm model
Adding:  ይቴ  to spm model
Adding:  ▁ሞጻ  to spm model
Adding:  መፃ  to spm model
dding:  ▁ዐራ  to spm model
Adding:  ኒአ  to spm model
Adding:  ላዕሌ  to spm model
Adding:  ▁ወዲበ  to spm model
Adding:  ዐኒ  to spm model
Adding:  ▁Jezreel  to spm model
Adding:  ሁኒ  to spm model
Adding:  ▁Gilgal  to spm model
Adding:  ሠሥ  to spm model
Adding:  ▁priests  to spm model
Adding:  ▁ዘውስተ  to spm model
Adding:  ጽሖ  to spm model
Adding:  ▁Cursed  to spm model
Adding:  ▁ካህን  to spm model
Adding:  ▁ይጸል  to spm model
Adding:  ይከ  to spm model
Adding:  ፸  to spm model
Adding:  ▁prophesied  to spm model
Adding:  ዐር  to spm model
Adding:  ሞር  to spm model
Adding:  ልፈ  to spm model
Adding:  ▁ገብርከ  to spm model
Adding:  servants  to spm model
Adding:  ▁bullock  to spm model
Adding:  ▁በውእቶን  to spm model
Adding:  ▁ርኢኩ  to spm model
Adding:  ባአ  to spm model
Adding:  ሕየ  to spm model
Adding:  ▁እብል  to spm model
Adding:  ▁ወባ  to spm model
Adding:  ▁hired  to spm model
Adding:  ▁በልቡ  to spm model
Adding:  ▁መልአኮሙ  to spm model
Adding:  ▁ለይእቲ  to spm model
Adding:  aroth  to spm model
Adding:  ፷  to spm model
Adding:  ▁smitten  to spm model
Adding:  ▁ዘመጽአ  to spm model
Adding:  ብስ  to spm model
Adding:  ▁ውእተ  to spm model
Adding:  ንዋ  to spm model
Adding:  ▁ዕረጉ  to spm model
Adding:  ▁haste  to spm model
Adding:  ▁ርስ  to spm model
Adding:  ould  to spm model
Adding:  ▁ወኢምንተ  to spm model
Adding:  ▁ለአምላክ  to spm model
Adding:  ክሉ  to spm model
Adding:  ዓዕ  to spm model
Adding:  ቦሙ  to spm model
Adding:  ▁አህጉር  to spm model
Adding:  ዕለተ  to spm model
Adding:  ▁መታክ  to spm model
Adding:  ▁ወኢነ  to spm model
Adding:  ▁በዕለተ  to spm model
Adding:  ▁ቀሠ  to spm model
Adding:  አነ  to spm model
Adding:  ▁ዕቅ  to spm model
Adding:  ▁ርስተ  to spm model
Adding:  ▁አዋልዲ  to spm model
Adding:  ▁ኀፍረተ  to spm model
Adding:  ▁በቤተ  to spm model
Adding:  ▁ዓመተ  to spm model
Adding:  ▁ገጹ  to spm model
.
.
.

I then trained the new model using LoRa weights and fusedadam as optimizer. Here is the config

share_vocab: true
src_vocab: "newdictionary.txt"
src_words_min_frequency: 1
src_vocab_size: 260926
tgt_vocab: "newdictionary.txt"
tgt_words_min_frequency: 1
tgt_vocab_size: 260926
vocab_size_multiple: 1
decoder_start_token: '</s>'

#LoRa
lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
lora_rank: 2
lora_dropout: 0.0
lora_alpha: 1
lora_embedding: false


#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0

# Corpus opts:
data:
    cc-matrix-enzh:
        path_src: "gmmt/en_train.txt"
        path_tgt: "gmmt/gez_train.txt"
        transforms: [sentencepiece, prefix, suffix, filtertoolong]
        weight: 10
        src_prefix: "</s> eng_Latn"
        tgt_prefix: "gez_Ethi"
        src_suffix: "</s>"
        tgt_suffix: ""
update_vocab: true
train_from: "nllb-200/nllb-200-1.3Bdst-onmt.pt.1"
reset_optim: all
save_data: "finetuned"
save_model: "finetuned/gez_nllb"
log_file: "finetuned/finetuned.log"
keep_checkpoint: 50
save_checkpoint_steps: 100
average_decay: 0.0005
seed: 1234
report_every: 10
train_steps: 20000
valid_steps: 100

# Batching
bucket_size: 262144
num_workers: 2
prefetch_factor: 400
world_size: 1
gpu_ranks: [0]
batch_type: "tokens"
batch_size: 256
valid_batch_size: 256
batch_size_multiple: 1
accum_count: [32, 32, 32]
accum_steps: [0, 15000, 30000]

# Optimization
model_dtype: "fp16"
optim: "fusedadam"
learning_rate: 0.1
warmup_steps: 50
decay_method: "noam"
adam_beta2: 0.98
max_grad_norm: 0
label_smoothing: 0.1
param_init: 0
param_init_glorot: true
normalization: "tokens"

# Model
override_opts: true
encoder_type: transformer
decoder_type: transformer
enc_layers: 24
dec_layers: 24
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 8192
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: 'SinusoidalConcat'
python3 ../OpenNMT-py/train.py --config finetuned/finetune.yaml
[2023-05-19 12:57:38,706 INFO] Loading checkpoint from nllb-200/nllb-200-1.3Bdst-onmt.pt.1
[2023-05-19 12:57:40,337 WARNING] configured transforms is different from checkpoint: +{'sentencepiece', 'suffix', 'prefix'}
[2023-05-19 12:57:40,337 INFO] Get suffix for cc-matrix-enzh: {'src': '</s>', 'tgt': ''}
[2023-05-19 12:57:40,337 INFO] Get suffix for src infer: 
[2023-05-19 12:57:40,337 INFO] Get suffix for tgt infer: 
[2023-05-19 12:57:40,337 INFO] Get prefix for cc-matrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-19 12:57:40,337 INFO] Get prefix for src infer: 
[2023-05-19 12:57:40,337 INFO] Get prefix for tgt infer: 
[2023-05-19 12:57:40,337 INFO] Get special vocabs from Transforms: {'src': ['</s>', '</s>', 'eng_Latn'], 'tgt': ['gez_Ethi']}.
[2023-05-19 12:57:40,902 INFO] Updating checkpoint vocabulary with new vocabulary
[2023-05-19 12:57:40,903 INFO] Get suffix for cc-matrix-enzh: {'src': '</s>', 'tgt': ''}
[2023-05-19 12:57:40,904 INFO] Get suffix for src infer: 
[2023-05-19 12:57:40,905 INFO] Get suffix for tgt infer: 
[2023-05-19 12:57:40,906 INFO] Get prefix for cc-matrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-19 12:57:40,908 INFO] Get prefix for src infer: 
[2023-05-19 12:57:40,909 INFO] Get prefix for tgt infer: 
[2023-05-19 12:57:40,911 INFO] Get special vocabs from Transforms: {'src': ['</s>', '</s>', 'eng_Latn'], 'tgt': ['gez_Ethi']}.
[2023-05-19 12:57:41,534 INFO] Over-ride model option set to true - use with care
[2023-05-19 12:57:41,534 INFO] Option: config , value: finetuned/finetune.yaml overiding model: 
[2023-05-19 12:57:41,534 INFO] Option: data , value: {'cc-matrix-enzh': {'path_src': 'gmmt/en_train.txt', 'path_tgt': 'gmmt/gez_train.txt', 'transforms': ['sentencepiece', 'prefix', 'suffix', 'filtertoolong'], 'weight': 10, 'src_prefix': '</s> eng_Latn', 'tgt_prefix': 'gez_Ethi', 'src_suffix': '</s>', 'tgt_suffix': '', 'path_align': None}} overiding model: {}
[2023-05-19 12:57:41,534 INFO] Option: skip_empty_level , value: warning overiding model: silent
[2023-05-19 12:57:41,534 INFO] Option: save_data , value: finetuned overiding model: 
[2023-05-19 12:57:41,534 INFO] Option: src_vocab , value: newdictionary.txt overiding model: 
[2023-05-19 12:57:41,534 INFO] Option: tgt_vocab , value: newdictionary.txt overiding model: 
[2023-05-19 12:57:41,534 INFO] Option: src_vocab_size , value: 260926 overiding model: 256206
[2023-05-19 12:57:41,534 INFO] Option: tgt_vocab_size , value: 260926 overiding model: 256206
[2023-05-19 12:57:41,534 INFO] Option: src_subword_model , value: flores200_sacrebleu_tokenizer_spm2.model overiding model: 
[2023-05-19 12:57:41,534 INFO] Option: tgt_subword_model , value: flores200_sacrebleu_tokenizer_spm2.model overiding model: 
[2023-05-19 12:57:41,535 INFO] Option: src_seq_length , value: 192 overiding model: 150
[2023-05-19 12:57:41,535 INFO] Option: tgt_seq_length , value: 192 overiding model: 150
[2023-05-19 12:57:41,535 INFO] Option: update_vocab , value: True overiding model: False
[2023-05-19 12:57:41,535 INFO] Option: add_qkvbias , value: False overiding model: True
[2023-05-19 12:57:41,535 INFO] Option: save_model , value: finetuned/gez_nllb overiding model: nllb
[2023-05-19 12:57:41,535 INFO] Option: save_checkpoint_steps , value: 100 overiding model: 5000
[2023-05-19 12:57:41,535 INFO] Option: train_from , value: nllb-200/nllb-200-1.3Bdst-onmt.pt.1 overiding model: 
[2023-05-19 12:57:41,535 INFO] Option: reset_optim , value: all overiding model: none
[2023-05-19 12:57:41,535 INFO] Option: num_workers , value: 2 overiding model: 4
[2023-05-19 12:57:41,535 INFO] Option: batch_size , value: 256 overiding model: 8192
[2023-05-19 12:57:41,535 INFO] Option: accum_count , value: [32, 32, 32] overiding model: [4]
[2023-05-19 12:57:41,535 INFO] Option: accum_steps , value: [0, 15000, 30000] overiding model: [0]
[2023-05-19 12:57:41,535 INFO] Option: valid_steps , value: 100 overiding model: 5000
[2023-05-19 12:57:41,535 INFO] Option: valid_batch_size , value: 256 overiding model: 4096
[2023-05-19 12:57:41,535 INFO] Option: train_steps , value: 20000 overiding model: 100000
[2023-05-19 12:57:41,535 INFO] Option: optim , value: fusedadam overiding model: 
[2023-05-19 12:57:41,535 INFO] Option: dropout , value: [0.1, 0.1, 0.1] overiding model: [0.1]
[2023-05-19 12:57:41,536 INFO] Option: attention_dropout , value: [0.1, 0.1, 0.1] overiding model: [0.1]
[2023-05-19 12:57:41,536 INFO] Option: dropout_steps , value: [0, 15000, 30000] overiding model: [0]
[2023-05-19 12:57:41,536 INFO] Option: average_decay , value: 0.0005 overiding model: 0.0
[2023-05-19 12:57:41,536 INFO] Option: learning_rate , value: 0.1 overiding model: 5e-05
[2023-05-19 12:57:41,536 INFO] Option: decay_method , value: noam overiding model: none
[2023-05-19 12:57:41,536 INFO] Option: warmup_steps , value: 50 overiding model: 4000
[2023-05-19 12:57:41,536 INFO] Option: log_file , value: finetuned/finetuned.log overiding model: 
[2023-05-19 12:57:41,536 INFO] Option: report_every , value: 10 overiding model: 100
[2023-05-19 12:57:41,536 INFO] Option: _all_transform , value: {'sentencepiece', 'filtertoolong', 'suffix', 'prefix'} overiding model: {'filtertoolong'}
[2023-05-19 12:57:41,536 INFO] Building model...
[2023-05-19 12:57:51,128 INFO] Adding LoRa layers for linear_values
[2023-05-19 12:57:51,924 INFO] Adding LoRa layers for linear_query
[2023-05-19 12:57:52,723 INFO] Adding LoRa layers for linear_keys
[2023-05-19 12:57:53,521 INFO] Adding LoRa layers for final_linear
[2023-05-19 12:58:03,997 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-05-19 12:58:04,384 INFO] src: 260921 new tokens
[2023-05-19 12:58:04,830 INFO] tgt: 260921 new tokens
[2023-05-19 12:58:07,084 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(260926, 1024, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0-23): 24 x TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=8192, bias=True)
          (w_2): Linear(in_features=8192, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(260926, 1024, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
    (transformer_layers): ModuleList(
      (0-23): 24 x TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=8192, bias=True)
          (w_2): Linear(in_features=8192, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=False)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=False)
        )
        (layer_norm_2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      )
    )
  )
  (generator): Linear(in_features=1024, out_features=260926, bias=True)
)
[2023-05-19 12:58:07,092 INFO] encoder: 771219456
[2023-05-19 12:58:07,092 INFO] decoder: 605397822
[2023-05-19 12:58:07,092 INFO] * number of parameters: 1376617278
[2023-05-19 12:58:07,092 INFO]  * src vocab size = 260926
[2023-05-19 12:58:07,092 INFO]  * tgt vocab size = 260926
[2023-05-19 12:58:07,195 INFO] Get suffix for cc-matrix-enzh: {'src': '</s>', 'tgt': ''}
[2023-05-19 12:58:07,195 INFO] Get suffix for src infer: 
[2023-05-19 12:58:07,195 INFO] Get suffix for tgt infer: 
[2023-05-19 12:58:07,196 INFO] Get prefix for cc-matrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-19 12:58:07,196 INFO] Get prefix for src infer: 
[2023-05-19 12:58:07,196 INFO] Get prefix for tgt infer: 
[2023-05-19 12:58:07,274 INFO] Get suffix for cc-matrix-enzh: {'src': '</s>', 'tgt': ''}
[2023-05-19 12:58:07,274 INFO] Get suffix for src infer: 
[2023-05-19 12:58:07,274 INFO] Get suffix for tgt infer: 
[2023-05-19 12:58:07,274 INFO] Get prefix for cc-matrix-enzh: {'src': '</s> eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-19 12:58:07,274 INFO] Get prefix for src infer: 
[2023-05-19 12:58:07,274 INFO] Get prefix for tgt infer: 
[2023-05-19 12:58:07,316 INFO] Starting training on GPU: [0]
[2023-05-19 12:58:07,316 INFO] Start training loop without validation...
[2023-05-19 12:58:07,316 INFO] Scoring with: TransformPipe()
[2023-05-19 13:00:29,289 INFO] Step 10/20000; acc: 83.8; ppl:  38.9; xent: 3.7; lr: 0.00010; sents:    2130; bsz:  229/ 168/ 7; 517/378 tok/s;    142 sec;
[2023-05-19 13:01:39,009 INFO] Step 20/20000; acc: 86.3; ppl:  29.0; xent: 3.4; lr: 0.00019; sents:    1961; bsz:  230/ 167/ 6; 1055/767 tok/s;    212 sec;
[2023-05-19 13:02:48,279 INFO] Step 30/20000; acc: 89.5; ppl:  18.7; xent: 2.9; lr: 0.00027; sents:    1936; bsz:  228/ 166/ 6; 1056/767 tok/s;    281 sec;
[2023-05-19 13:03:57,596 INFO] Step 40/20000; acc: 91.5; ppl:  12.0; xent: 2.5; lr: 0.00036; sents:    2027; bsz:  230/ 169/ 6; 1063/782 tok/s;    350 sec;
[2023-05-19 13:05:06,485 INFO] Step 50/20000; acc: 92.2; ppl:   9.7; xent: 2.3; lr: 0.00044; sents:    2007; bsz:  229/ 167/ 6; 1064/777 tok/s;    419 sec;
[2023-05-19 13:06:15,215 INFO] Step 60/20000; acc: 92.4; ppl:   8.8; xent: 2.2; lr: 0.00040; sents:    1999; bsz:  231/ 167/ 6; 1075/778 tok/s;    488 sec;

I merged the LoRa weights with the base model in this way and tried to infer using the config below.

python3 ../OpenNMT-py/tools/lora_weights.py --action merge --base_model nllb-200/nllb-200-1.3Bdst-onmt.pt --lora_weights finetuned/gez_nllb_step_20000.pt  --output geez_nllb_finetuned.pt
transforms: [sentencepiece, prefix, suffix]
# nllb-200 specific prefixing and suffixing
src_prefix: "eng_Latn"
tgt_prefix: "fra_Latn"
tgt_file_prefix: true
src_suffix: "</s>"
tgt_suffix: ""


#### Subword
src_subword_model: "flores200_sacrebleu_tokenizer_spm2.model"
tgt_subword_model: "flores200_sacrebleu_tokenizer_spm2.model"
src_subword_nbest: 1
src_subword_alpha: 0.0
tgt_subword_nbest: 1
tgt_subword_alpha: 0.0
# Model info
model: "geez_nllb_finetuned_1.pt"
# Inference
max_length: 512
gpu: 0
batch_type: tokens
batch_size: 32
fp16:
beam_size: 5
report_time: true
python3 ../OpenNMT-py/translate.py --config finetuned/geez_nllb_inference.yaml -src en_text.src -output gez_hyp.txt

But raised the following error.

Traceback (most recent call last):
  File "../OpenNMT-py/translate.py", line 6, in <module>
    main()
  File "/home/aman/Documents/geeztranslation/OpenNMT-py/onmt/bin/translate.py", line 60, in main
    translate(opt)
  File "/home/aman/Documents/geeztranslation/OpenNMT-py/onmt/bin/translate.py", line 23, in translate
    translator = build_translator(opt, logger=logger,
  File "/home/aman/Documents/geeztranslation/OpenNMT-py/onmt/translate/translator.py", line 33, in build_translator
    vocabs, model, model_opt = load_test_model(opt)
  File "/home/aman/Documents/geeztranslation/OpenNMT-py/onmt/model_builder.py", line 171, in load_test_model
    model = build_base_model(model_opt, vocabs, checkpoint)
  File "/home/aman/Documents/geeztranslation/OpenNMT-py/onmt/model_builder.py", line 402, in build_base_model
    model.load_state_dict(checkpoint['model'],
  File "/home/aman/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for NMTModel:
	Missing key(s) in state_dict: "encoder.transformer.0.self_attn.linear_keys.bias", "encoder.transformer.0.self_attn.linear_values.bias", "encoder.transformer.0.self_attn.linear_query.bias", "encoder.transformer.0.self_attn.final_linear.bias", "encoder.transformer.1.self_attn.linear_keys.bias", "encoder.transformer.1.self_attn.linear_values.bias", "encoder.transformer.1.self_attn.linear_query.bias", "encoder.transformer.1.self_attn.final_linear.bias", "encoder.transformer.2.self_attn.linear_keys.bias", "encoder.transformer.2.self_attn.linear_values.bias", "encoder.transformer.2.self_attn.linear_query.bias", "encoder.transformer.2.self_attn.final_linear.bias", "encoder.transformer.3.self_attn.linear_keys.bias", "encoder.transformer.3.self_attn.linear_values.bias", "encoder.transformer.3.self_attn.linear_query.bias", "encoder.transformer.3.self_attn.final_linear.bias", "encoder.transformer.4.self_attn.linear_keys.bias", "encoder.transformer.4.self_attn.linear_values.bias", "encoder.transformer.4.self_attn.linear_query.bias", "encoder.transformer.4.self_attn.final_linear.bias", "encoder.transformer.5.self_attn.linear_keys.bias", "encoder.transformer.5.self_attn.linear_values.bias", "encoder.transformer.5.self_attn.linear_query.bias", "encoder.transformer.5.self_attn.final_linear.bias", "encoder.transformer.6.self_attn.linear_keys.bias", "encoder.transformer.6.self_attn.linear_values.bias", "encoder.transformer.6.self_attn.linear_query.bias", "encoder.transformer.6.self_attn.final_linear.bias", "encoder.transformer.7.self_attn.linear_keys.bias", "encoder.transformer.7.self_attn.linear_values.bias", "encoder.transformer.7.self_attn.linear_query.bias", "encoder.transformer.7.self_attn.final_linear.bias", "encoder.transformer.8.self_attn.linear_keys.bias", "encoder.transformer.8.self_attn.linear_values.bias", "encoder.transformer.8.self_attn.linear_query.bias", "encoder.transformer.8.self_attn.final_linear.bias", "encoder.transformer.9.self_attn.linear_keys.bias", "encoder.transformer.9.self_attn.linear_values.bias", "encoder.transformer.9.self_attn.linear_query.bias", "encoder.transformer.9.self_attn.final_linear.bias", "encoder.transformer.10.self_attn.linear_keys.bias", "encoder.transformer.10.self_attn.linear_values.bias", "encoder.transformer.10.self_attn.linear_query.bias", "encoder.transformer.10.self_attn.final_linear.bias", "encoder.transformer.11.self_attn.linear_keys.bias", "encoder.transformer.11.self_attn.linear_values.bias", "encoder.transformer.11.self_attn.linear_query.bias", "encoder.transformer.11.self_attn.final_linear.bias", "encoder.transformer.12.self_attn.linear_keys.bias", "encoder.transformer.12.self_attn.linear_values.bias", "encoder.transformer.12.self_attn.linear_query.bias", "encoder.transformer.12.self_attn.final_linear.bias", "encoder.transformer.13.self_attn.linear_keys.bias", "encoder.transformer.13.self_attn.linear_values.bias", "encoder.transformer.13.self_attn.linear_query.bias", "encoder.transformer.13.self_attn.final_linear.bias", "encoder.transformer.14.self_attn.linear_keys.bias", "encoder.transformer.14.self_attn.linear_values.bias", "encoder.transformer.14.self_attn.linear_query.bias", "encoder.transformer.14.self_attn.final_linear.bias", "encoder.transformer.15.self_attn.linear_keys.bias", "encoder.transformer.15.self_attn.linear_values.bias", "encoder.transformer.15.self_attn.linear_query.bias", "encoder.transformer.15.self_attn.final_linear.bias", "encoder.transformer.16.self_attn.linear_keys.bias", "encoder.transformer.16.self_attn.linear_values.bias", "encoder.transformer.16.self_attn.linear_query.bias", "encoder.transformer.16.self_attn.final_linear.bias", "encoder.transformer.17.self_attn.linear_keys.bias", "encoder.transformer.17.self_attn.linear_values.bias", "encoder.transformer.17.self_attn.linear_query.bias", "encoder.transformer.17.self_attn.final_linear.bias", "encoder.transformer.18.self_attn.linear_keys.bias", "encoder.transformer.18.self_attn.linear_values.bias", "encoder.transformer.18.self_attn.linear_query.bias", "encoder.transformer.18.self_attn.final_linear.bias", "encoder.transformer.19.self_attn.linear_keys.bias", "encoder.transformer.19.self_attn.linear_values.bias", "encoder.transformer.19.self_attn.linear_query.bias", "encoder.transformer.19.self_attn.final_linear.bias", "encoder.transformer.20.self_attn.linear_keys.bias", "encoder.transformer.20.self_attn.linear_values.bias", "encoder.transformer.20.self_attn.linear_query.bias", "encoder.transformer.20.self_attn.final_linear.bias", "encoder.transformer.21.self_attn.linear_keys.bias", "encoder.transformer.21.self_attn.linear_values.bias", "encoder.transformer.21.self_attn.linear_query.bias", "encoder.transformer.21.self_attn.final_linear.bias", "encoder.transformer.22.self_attn.linear_keys.bias", "encoder.transformer.22.self_attn.linear_values.bias", "encoder.transformer.22.self_attn.linear_query.bias", "encoder.transformer.22.self_attn.final_linear.bias"

Based on a discussion here Finetuning bigger models with LoRa I fixed it in the following way and the inference run successfully. I had to also reduce the batch_size to 32 because of OOM issue.

 import torch
m = torch.load("geez_nllb_finetuned.pt")
m['opt'].add_qkvbias=False
torch.save(m, "geez_nllb_finetuned_1.pt")

But the translation is weird.

 ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ 
 ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ 

Please help…

before training anything with onmt-py, you need to check if your sentencepiece model works fine with your data.

Try to tokenize your training data in command line or in a python notebook (search the web for this)

Then you need to double check that your tokenized data (tokens) are valid wrt the newdictionary.txt file.

This steps needs to be validated before anything else.

I actually checked the spm model is working.

sp = spm.SentencePieceProcessor(model_file='flores200_sacrebleu_tokenizer_spm2.model')
sp.encode_as_pieces('ወይቤለኒ መልአከ እግዚአብሔር በሕልም ያዕቆብ ያዕቆብ ወእቤ ነየ አነ ምንትኑ ውእቱ')

Here is the output. It looks working.

['▁ወይቤለኒ',
 '▁መልአከ',
 '▁እግዚአ',
 'ብ',
 'ሔር',
 '▁በሕ',
 'ልም',
 '▁ያዕቆብ',
 '▁ያዕቆብ',
 '▁ወእቤ',
 '▁ነየ',
 '▁አነ',
 '▁ምንትኑ',
 '▁ውእቱ']

each of those tokens are in the newdictionary.txt file ?

if so maybe you can try to run the training command line with an extra option -n_sample 1000

then look at the result in the save_data folder that you set in the training config.

(training won’t run, it’s just a sanity check)

The tokens exist in the newdictionary.txt file. I run the training with n_sample option and found a src and tgt tokenized sentences under a folder named sample

</s> eng_Latn ▁And ▁for ▁the ▁second ▁side ▁of ▁the ▁taber na cle ▁on ▁the ▁north ▁side ▁there ▁shall ▁be ▁twenty ▁boards : </s>
</s> eng_Latn ▁Therefore ▁his ▁people ▁return ▁hither : ▁and ▁waters ▁of ▁a ▁full ▁cup ▁are ▁wr ung ▁out ▁to ▁them . </s>
</s> eng_Latn ▁The ▁kee per ▁of ▁the ▁prison ▁looked ▁not ▁to ▁any ▁thing ▁that ▁was ▁under ▁his ▁hand ; ▁because ▁the ▁LORD ▁was ▁with ▁him , ▁and ▁that ▁which ▁he ▁did , ▁the ▁LORD ▁made ▁it ▁to ▁prosper . </s>
</s> eng_Latn ▁Z elek ▁the ▁Ammon ite , ▁Nah arai ▁the ▁Be eroth ite , ▁armour be ar er ▁to ▁Joab ▁the ▁son ▁of ▁Z eru iah , </s>
</s> eng_Latn ▁His ▁sons , ▁and ▁his ▁sons ' ▁sons ▁with ▁him , ▁his ▁da ugh ters , ▁and ▁his ▁sons ' ▁da ugh ters , ▁and ▁all ▁his ▁seed ▁brought ▁he ▁with ▁him ▁into ▁Egypt . </s>
</s> eng_Latn ▁And ▁the ▁LORD ▁spake ▁unto ▁Moses , ▁Say ▁unto ▁Aaron , ▁Str et ch ▁forth ▁thine ▁hand ▁with ▁thy ▁rod ▁over ▁the ▁stre ams , ▁over ▁the ▁ri vers , ▁and ▁over ▁the ▁pond s , ▁and ▁cause ▁fr ogs ▁to ▁come ▁up ▁upon ▁the ▁land ▁of ▁Egypt . </s>
</s> eng_Latn ▁And ▁they ▁did ▁so ; ▁for ▁Aaron ▁stretched ▁out ▁his ▁hand ▁with ▁his ▁rod , ▁and ▁smote ▁the ▁dust ▁of ▁the ▁earth , ▁and ▁it ▁became ▁lice ▁in ▁man , ▁and ▁in ▁beast ; ▁all ▁the ▁dust ▁of ▁the ▁land ▁became ▁lice ▁throughout ▁all ▁the ▁land ▁of ▁Egypt . </s>
.
.
.
gez_Ethi ▁ወካ ልእ ▁እምገጸ ▁ዐረቢ ▁C ኦ ሪት - ዘ ጸ አት -26 S 20 E ዐም ዱ ▁፤
gez_Ethi ▁ወበእንተ ዝ ▁ይት መየጡ ▁ሕዝብየ ▁እምዝየ ፤ ▁ወይ ትረ ከብ ▁ፍጹም ▁መዋዕል ▁በላዕሌ ሆሙ።
gez_Ethi ▁ወአልቦ ▁ዘያ አምር ▁ኵሎ ▁ዘይት ገበር ▁በቤተ ▁ሞቅሕ ▁ሊቀ ▁ዐ ቀብ ተ ▁ሞቅሕ ▁ወኢ ምን ተኒ ▁እስመ ▁ኀደገ ▁ሎቱ ▁ኵሎ ▁ለዮሴፍ ▁እስመ ▁እግዚአ ብ ሔር ▁ሀሎ ▁ምስሌሁ ▁ወኵሎ ▁ዘገብረ ▁ይ ሴር ሖ ▁እግዚአ ብ ሔር ▁በእዴሁ ▁።
gez_Ethi ▁ኤል ዩ ▁ዐ መናዊ ▁፤ ▁ጌ ሎ ሬ ▁ቤ ሮ ታዊ ▁ዘይጸውር ▁ንዋየ ▁ሐቅሉ ▁ለኢዮአብ ▁ወልደ ▁ ሶር ህያ ▁።
gez_Ethi ▁ደቂቁ ▁ወደ ቂ ቀ ▁ደቂቁ ▁ወአ ዋልዲሁ ▁ወአዋልደ ▁አዋል ዲሁ ▁ምስሌሁ ▁።
gez_Ethi ▁ወይቤሎ ▁ሙሴ ▁ለፈርዖን ▁ዐ ድ መኒ ▁ማዕ ዜ ▁እ ጸ ሊ ▁ዲቤ ከ ▁ወዲበ ▁ዐበይ ትከ ▁ወዲበ ▁ሕዝብከ ▁ከመ ▁ይ ማስን ▁ቈ ርነ ና ዓት ▁እምኔከ ▁ወእ ምሕ ዝብ ከ ▁ወእምአ ብ ይ ቲክሙ ▁እንበለ ▁ውስተ ▁ተከዚ ▁ይተ ር ፍ ▁።
gez_Ethi ▁ወእመ ▁አበ ይከ ▁ፈንዎ ተ ▁ሕዝብየ ▁ናሁ ▁አነ ▁እፌኑ ▁ዲቤ ከ ▁ወዲበ ▁ዐበይ ትከ ▁ወዲበ ▁ሕዝብከ ▁ወዲበ ▁አብያ ቲከ ▁ጽ ንጽ ያ ▁ከልብ ▁ወይ መል እ ▁አብያተ ▁ግብጽ ▁ጽ ንጽ ያ ▁ከልብ ▁ወውስተ ሂ ▁ምድር ▁እንተ ▁ሀለዉ ▁ውስቴታ ▁።
gez_Ethi ▁ወዲ ሶን ▁መስፍን ▁ወኢ ሶር ▁መስፍን ▁ወ * ዲ ሳ * ን ▁መስፍን ▁፤ ▁እሉ ▁እሙንቱ ▁መሳፍ ንተ ▁ሆ ሪ ▁በበ መ ሳፍ ንቲሆሙ ▁ውስተ ▁ምድረ ▁ሴይር ▁።
gez_Ethi ▁ወአግብ ኣ ▁እግዚአ ብ ሔር ▁ለላ ኪስ ▁ውስተ ▁እዴሆሙ ▁ለ እስ ራኤል ▁ወነ ሥእዋ ▁በሳ ኒ ተ ▁ዕለት ▁ወቀተልዎሙ ▁በአፈ ▁ኀፂን ▁ወአ ጥፍእ ዋ ▁በከመ ▁ገብር ዋ ▁ለ ሌብ ና ▁።
.
.
.

Does this look right?

can you give me a link to your newdictionary file, as well as your finetuned.pt (just the lora weights)

I’ll try to check when I get a moment.

Sure, I will do that. Thank you!

Here is the link. I appreciate your help.

Well…

  1. I doubt with 4K training example you will make the model learn your new language. You would need much more data especially with a new alphabet. (unless I am wrong and this alphabet is already in the vocab.

  2. when you prepared the new vocab entries you should not add English. You would need to learn a spm model on the gez data only, make a small vocab on this data only, add the entries in the vocab and modify the spm model

  3. you forgot to add the new language token.
    here

yor_Latn 1 
yue_Hant 1 
zho_Hans 1 
zho_Hant 1 
zul_Latn 1 
▁weep 1 
▁ወለያ 1 
ልአ 1 
▁ወፋ 1 

before the new vocab just after zul_Latn, you need to add gez_Ethi or something like this.

You still can try but I doubt you will come up with good results.

1 Like
  1. Once I see the performance on this data. I will try to come up with more data. The alphabet is already in the vocab of the NLLB model for two other languages(amh_Ethi and tir_Ethi).
  2. I will do that!
  3. I actually added the new language token here.
▁ወመድኀኒ 1 
▁ግብር 1 
ልወ 1 
ናታን 1 
▁ወንጉሠ 1 
▁ወረከ 1 
gez_Ethi 1 
አር 1 
ቄድ 1 
▁ፈድፋደ 1 
ረክበ 1 
▁ሖሩ 1 
▁ጽላተ 1 
▁ገብኡ 1 
▁እገብር 1 

But it’s not in the order you said. Does the order matter?

no doesn’t matter.

1 Like

I prepared the new vocab with the just the new language and tested but the result is the same.
I’m confused about this

[2023-05-24 20:25:48,314 INFO] Adding LoRa layers for linear_values
[2023-05-24 20:25:49,118 INFO] Adding LoRa layers for linear_query
[2023-05-24 20:25:49,990 INFO] Adding LoRa layers for linear_keys
[2023-05-24 20:25:50,824 INFO] Adding LoRa layers for final_linear
[2023-05-24 20:26:02,001 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-05-24 20:26:02,391 INFO] src: 259368 new tokens
[2023-05-24 20:26:02,811 INFO] tgt: 259368 new tokens

“259368 new tokens” Is it considering all the tokens as new tokens added? I thought It should be only the new tokens I added for the new language.

I run the training for the original dictionary from the nllb keeping the other settings the same. The accuracy looks good.

[2023-05-25 08:15:24,991 INFO] Parsed 1 corpora from -data.
[2023-05-25 08:15:24,991 INFO] Loading checkpoint from nllb-200/nllb-200-1.3Bdst-onmt.pt.1
[2023-05-25 08:15:26,602 WARNING] configured transforms is different from checkpoint: +{'prefix', 'sentencepiece', 'suffix'}
[2023-05-25 08:15:26,602 INFO] Get suffix for gmmt: {'src': '</s>', 'tgt': ''}
[2023-05-25 08:15:26,602 INFO] Get suffix for src infer: 
[2023-05-25 08:15:26,602 INFO] Get suffix for tgt infer: 
[2023-05-25 08:15:26,603 INFO] Get prefix for gmmt: {'src': 'eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-25 08:15:26,603 INFO] Get prefix for src infer: 
[2023-05-25 08:15:26,603 INFO] Get prefix for tgt infer: 
[2023-05-25 08:15:26,603 INFO] Get special vocabs from Transforms: {'src': ['</s>', 'eng_Latn'], 'tgt': ['gez_Ethi']}.
[2023-05-25 08:15:27,221 INFO] Updating checkpoint vocabulary with new vocabulary
[2023-05-25 08:15:27,223 INFO] Get suffix for gmmt: {'src': '</s>', 'tgt': ''}
[2023-05-25 08:15:27,224 INFO] Get suffix for src infer: 
[2023-05-25 08:15:27,225 INFO] Get suffix for tgt infer: 
[2023-05-25 08:15:27,226 INFO] Get prefix for gmmt: {'src': 'eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-25 08:15:27,228 INFO] Get prefix for src infer: 
[2023-05-25 08:15:27,229 INFO] Get prefix for tgt infer: 
[2023-05-25 08:15:27,230 INFO] Get special vocabs from Transforms: {'src': ['</s>', 'eng_Latn'], 'tgt': ['gez_Ethi']}.
[2023-05-25 08:15:27,921 INFO] Over-ride model option set to true - use with care
[2023-05-25 08:15:27,921 INFO] Option: config , value: finetuned/finetune.yaml overiding model: 
[2023-05-25 08:15:27,921 INFO] Option: data , value: {'gmmt': {'path_src': 'gmmt/en_train.txt', 'path_tgt': 'gmmt/gez_train.txt', 'transforms': ['sentencepiece', 'prefix', 'suffix', 'filtertoolong'], 'weight': 10, 'src_prefix': 'eng_Latn', 'tgt_prefix': 'gez_Ethi', 'src_suffix': '</s>', 'tgt_suffix': '', 'path_align': None}} overiding model: {}
[2023-05-25 08:15:27,921 INFO] Option: skip_empty_level , value: warning overiding model: silent
[2023-05-25 08:15:27,921 INFO] Option: save_data , value: finetuned overiding model: 
[2023-05-25 08:15:27,921 INFO] Option: src_vocab , value: nllb-200/dictionary.txt overiding model: 
[2023-05-25 08:15:27,921 INFO] Option: tgt_vocab , value: nllb-200/dictionary.txt overiding model: 
[2023-05-25 08:15:27,921 INFO] Option: src_vocab_size , value: 259373 overiding model: 256206
[2023-05-25 08:15:27,921 INFO] Option: tgt_vocab_size , value: 259373 overiding model: 256206
[2023-05-25 08:15:27,921 INFO] Option: src_subword_model , value: flores200_sacrebleu_tokenizer_spm2.model overiding model: 
[2023-05-25 08:15:27,921 INFO] Option: tgt_subword_model , value: flores200_sacrebleu_tokenizer_spm2.model overiding model: 
[2023-05-25 08:15:27,921 INFO] Option: src_seq_length , value: 192 overiding model: 150
[2023-05-25 08:15:27,921 INFO] Option: tgt_seq_length , value: 192 overiding model: 150
[2023-05-25 08:15:27,922 INFO] Option: update_vocab , value: True overiding model: False
[2023-05-25 08:15:27,922 INFO] Option: save_model , value: finetuned/gez_nllb overiding model: nllb
[2023-05-25 08:15:27,922 INFO] Option: save_checkpoint_steps , value: 20 overiding model: 5000
[2023-05-25 08:15:27,922 INFO] Option: train_from , value: nllb-200/nllb-200-1.3Bdst-onmt.pt.1 overiding model: 
[2023-05-25 08:15:27,922 INFO] Option: reset_optim , value: all overiding model: none
[2023-05-25 08:15:27,922 INFO] Option: num_workers , value: 2 overiding model: 4
[2023-05-25 08:15:27,922 INFO] Option: batch_size , value: 256 overiding model: 8192
[2023-05-25 08:15:27,922 INFO] Option: accum_count , value: [32, 32, 32] overiding model: [4]
[2023-05-25 08:15:27,922 INFO] Option: accum_steps , value: [0, 15000, 30000] overiding model: [0]
[2023-05-25 08:15:27,922 INFO] Option: valid_steps , value: 100 overiding model: 5000
[2023-05-25 08:15:27,922 INFO] Option: valid_batch_size , value: 256 overiding model: 4096
[2023-05-25 08:15:27,922 INFO] Option: train_steps , value: 20000 overiding model: 100000
[2023-05-25 08:15:27,922 INFO] Option: optim , value: fusedadam overiding model: 
[2023-05-25 08:15:27,922 INFO] Option: dropout , value: [0.1, 0.1, 0.1] overiding model: [0.1]
[2023-05-25 08:15:27,922 INFO] Option: attention_dropout , value: [0.1, 0.1, 0.1] overiding model: [0.1]
[2023-05-25 08:15:27,922 INFO] Option: dropout_steps , value: [0, 15000, 30000] overiding model: [0]
[2023-05-25 08:15:27,922 INFO] Option: average_decay , value: 0.0005 overiding model: 0.0
[2023-05-25 08:15:27,922 INFO] Option: learning_rate , value: 0.1 overiding model: 5e-05
[2023-05-25 08:15:27,923 INFO] Option: decay_method , value: noam overiding model: none
[2023-05-25 08:15:27,923 INFO] Option: warmup_steps , value: 50 overiding model: 4000
[2023-05-25 08:15:27,923 INFO] Option: log_file , value: finetuned/finetuned.log overiding model: 
[2023-05-25 08:15:27,923 INFO] Option: report_every , value: 10 overiding model: 100
[2023-05-25 08:15:27,923 INFO] Option: _all_transform , value: {'filtertoolong', 'sentencepiece', 'suffix', 'prefix'} overiding model: {'filtertoolong'}
[2023-05-25 08:15:27,923 INFO] Building model...
[2023-05-25 08:15:38,356 INFO] Adding LoRa layers for linear_values
[2023-05-25 08:15:39,213 INFO] Adding LoRa layers for linear_query
[2023-05-25 08:15:40,030 INFO] Adding LoRa layers for linear_keys
[2023-05-25 08:15:40,834 INFO] Adding LoRa layers for final_linear
[2023-05-25 08:15:51,918 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-05-25 08:15:53,593 INFO] src: 0 new tokens
[2023-05-25 08:15:57,185 INFO] tgt: 0 new tokens
[2023-05-25 08:15:59,533 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(256206, 1024, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0-23): 24 x TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=8192, bias=True)
          (w_2): Linear(in_features=8192, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(256206, 1024, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
    (transformer_layers): ModuleList(
      (0-23): 24 x TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=8192, bias=True)
          (w_2): Linear(in_features=8192, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (layer_norm_2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      )
    )
  )
  (generator): Linear(in_features=1024, out_features=256206, bias=True)
)
[2023-05-25 08:15:59,541 INFO] encoder: 766484480
[2023-05-25 08:15:59,541 INFO] decoder: 605589710
[2023-05-25 08:15:59,542 INFO] * number of parameters: 1372074190
[2023-05-25 08:15:59,542 INFO]  * src vocab size = 256206
[2023-05-25 08:15:59,542 INFO]  * tgt vocab size = 256206

This fp16_optimizer is designed to only work with apex.contrib.optimizers.*
To update, use updated optimizers with AMP.
[2023-05-25 08:15:59,655 INFO] Get suffix for gmmt: {'src': '</s>', 'tgt': ''}
[2023-05-25 08:15:59,655 INFO] Get suffix for src infer: 
[2023-05-25 08:15:59,655 INFO] Get suffix for tgt infer: 
[2023-05-25 08:15:59,655 INFO] Get prefix for gmmt: {'src': 'eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-25 08:15:59,655 INFO] Get prefix for src infer: 
[2023-05-25 08:15:59,655 INFO] Get prefix for tgt infer: 
[2023-05-25 08:15:59,734 INFO] Get suffix for gmmt: {'src': '</s>', 'tgt': ''}
[2023-05-25 08:15:59,734 INFO] Get suffix for src infer: 
[2023-05-25 08:15:59,734 INFO] Get suffix for tgt infer: 
[2023-05-25 08:15:59,734 INFO] Get prefix for gmmt: {'src': 'eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-25 08:15:59,734 INFO] Get prefix for src infer: 
[2023-05-25 08:15:59,734 INFO] Get prefix for tgt infer: 
[2023-05-25 08:15:59,778 INFO] Starting training on GPU: [0]
[2023-05-25 08:15:59,778 INFO] Start training loop without validation...
[2023-05-25 08:15:59,778 INFO] Scoring with: TransformPipe()

Grad overflow on iteration 1
Using dynamic loss scale of 65536
[2023-05-25 08:18:40,883 INFO] Step 10/20000; acc: 30.6; ppl: 888.3; xent: 6.8; lr: 0.00010; sents:    2064; bsz:  230/ 164/ 6; 458/325 tok/s;    161 sec;

Grad overflow on iteration 13
Using dynamic loss scale of 32768.0
[2023-05-25 08:20:04,527 INFO] Step 20/20000; acc: 33.3; ppl: 736.5; xent: 6.6; lr: 0.00019; sents:    1900; bsz:  230/ 163/ 6; 880/624 tok/s;    245 sec;
[2023-05-25 08:20:04,610 INFO] Saving checkpoint finetuned/gez_nllb_step_20.pt
[2023-05-25 08:21:29,094 INFO] Step 30/20000; acc: 38.4; ppl: 536.0; xent: 6.3; lr: 0.00027; sents:    1840; bsz:  227/ 161/ 6; 861/608 tok/s;    329 sec;

The inference also worked for a language originally in nllb(amh_Ethi) though it returned the usual characters( ⁇ ⁇ ⁇) for the Ge’ez(gez_Ethi) as the language token is not in the vocab. Doesn’t this mean there’s something wrong in the way I prepared the dictionary?

with open('nllb-200/dictionary.txt', 'r') as file:
    nllb_tokens = [line.strip().split()[0] for line in file]

added_tokens = set(gmmt_tokens).difference(set(nllb_tokens))

newtokens = nllb_tokens[:-205] + list(added_tokens) + ['gez_Ethi'] + nllb_tokens[-205:]

with open('newdictionary.txt', 'w') as file:
    for line in nllb_tokens:
        file.write(f"{line} 1 \n")

It should be about the format of the newdictionary file I prepared. Even with just copying the original nllb tokens into a newdictionary file, it didn’t work. The files looks exactly the same to me. How did you add the new tokens into the dictionary?

file.write(f"{line} 1 \n")

I found it! The problem was with the space I added after 1 in writing each token to the dictionary file. It should have been file.write(f"{line} 1\n"), Fixing that It looks good now.

[2023-05-25 09:14:07,519 INFO] Parsed 1 corpora from -data.
[2023-05-25 09:14:07,520 INFO] Loading checkpoint from nllb-200/nllb-200-1.3Bdst-onmt.pt.1
[2023-05-25 09:14:09,482 WARNING] configured transforms is different from checkpoint: +{'suffix', 'sentencepiece', 'prefix'}
[2023-05-25 09:14:09,482 INFO] Get suffix for gmmt: {'src': '</s>', 'tgt': ''}
[2023-05-25 09:14:09,482 INFO] Get suffix for src infer: 
[2023-05-25 09:14:09,482 INFO] Get suffix for tgt infer: 
[2023-05-25 09:14:09,482 INFO] Get prefix for gmmt: {'src': 'eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-25 09:14:09,482 INFO] Get prefix for src infer: 
[2023-05-25 09:14:09,482 INFO] Get prefix for tgt infer: 
[2023-05-25 09:14:09,482 INFO] Get special vocabs from Transforms: {'src': ['</s>', 'eng_Latn'], 'tgt': ['gez_Ethi']}.
[2023-05-25 09:14:10,275 INFO] Updating checkpoint vocabulary with new vocabulary
[2023-05-25 09:14:10,276 INFO] Get suffix for gmmt: {'src': '</s>', 'tgt': ''}
[2023-05-25 09:14:10,278 INFO] Get suffix for src infer: 
[2023-05-25 09:14:10,279 INFO] Get suffix for tgt infer: 
[2023-05-25 09:14:10,280 INFO] Get prefix for gmmt: {'src': 'eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-25 09:14:10,282 INFO] Get prefix for src infer: 
[2023-05-25 09:14:10,283 INFO] Get prefix for tgt infer: 
[2023-05-25 09:14:10,285 INFO] Get special vocabs from Transforms: {'src': ['</s>', 'eng_Latn'], 'tgt': ['gez_Ethi']}.
[2023-05-25 09:14:11,078 INFO] Over-ride model option set to true - use with care
[2023-05-25 09:14:11,078 INFO] Option: config , value: finetuned/finetune.yaml overiding model: 
[2023-05-25 09:14:11,078 INFO] Option: data , value: {'gmmt': {'path_src': 'gmmt/en_train.txt', 'path_tgt': 'gmmt/gez_train.txt', 'transforms': ['sentencepiece', 'prefix', 'suffix', 'filtertoolong'], 'weight': 10, 'src_prefix': 'eng_Latn', 'tgt_prefix': 'gez_Ethi', 'src_suffix': '</s>', 'tgt_suffix': '', 'path_align': None}} overiding model: {}
[2023-05-25 09:14:11,078 INFO] Option: skip_empty_level , value: warning overiding model: silent
[2023-05-25 09:14:11,078 INFO] Option: save_data , value: finetuned overiding model: 
[2023-05-25 09:14:11,078 INFO] Option: src_vocab , value: newdictionary.txt overiding model: 
[2023-05-25 09:14:11,078 INFO] Option: tgt_vocab , value: newdictionary.txt overiding model: 
[2023-05-25 09:14:11,078 INFO] Option: src_vocab_size , value: 259373 overiding model: 256206
[2023-05-25 09:14:11,078 INFO] Option: tgt_vocab_size , value: 259373 overiding model: 256206
[2023-05-25 09:14:11,078 INFO] Option: src_subword_model , value: flores200_sacrebleu_tokenizer_spm2.model overiding model: 
[2023-05-25 09:14:11,078 INFO] Option: tgt_subword_model , value: flores200_sacrebleu_tokenizer_spm2.model overiding model: 
[2023-05-25 09:14:11,078 INFO] Option: src_seq_length , value: 192 overiding model: 150
[2023-05-25 09:14:11,078 INFO] Option: tgt_seq_length , value: 192 overiding model: 150
[2023-05-25 09:14:11,078 INFO] Option: update_vocab , value: True overiding model: False
[2023-05-25 09:14:11,079 INFO] Option: save_model , value: finetuned/gez_nllb overiding model: nllb
[2023-05-25 09:14:11,079 INFO] Option: save_checkpoint_steps , value: 20 overiding model: 5000
[2023-05-25 09:14:11,079 INFO] Option: train_from , value: nllb-200/nllb-200-1.3Bdst-onmt.pt.1 overiding model: 
[2023-05-25 09:14:11,079 INFO] Option: reset_optim , value: all overiding model: none
[2023-05-25 09:14:11,079 INFO] Option: num_workers , value: 2 overiding model: 4
[2023-05-25 09:14:11,079 INFO] Option: batch_size , value: 256 overiding model: 8192
[2023-05-25 09:14:11,079 INFO] Option: accum_count , value: [32, 32, 32] overiding model: [4]
[2023-05-25 09:14:11,079 INFO] Option: accum_steps , value: [0, 15000, 30000] overiding model: [0]
[2023-05-25 09:14:11,079 INFO] Option: valid_steps , value: 100 overiding model: 5000
[2023-05-25 09:14:11,079 INFO] Option: valid_batch_size , value: 256 overiding model: 4096
[2023-05-25 09:14:11,079 INFO] Option: train_steps , value: 20000 overiding model: 100000
[2023-05-25 09:14:11,079 INFO] Option: optim , value: fusedadam overiding model: 
[2023-05-25 09:14:11,079 INFO] Option: dropout , value: [0.1, 0.1, 0.1] overiding model: [0.1]
[2023-05-25 09:14:11,079 INFO] Option: attention_dropout , value: [0.1, 0.1, 0.1] overiding model: [0.1]
[2023-05-25 09:14:11,079 INFO] Option: dropout_steps , value: [0, 15000, 30000] overiding model: [0]
[2023-05-25 09:14:11,079 INFO] Option: average_decay , value: 0.0005 overiding model: 0.0
[2023-05-25 09:14:11,079 INFO] Option: learning_rate , value: 0.1 overiding model: 5e-05
[2023-05-25 09:14:11,079 INFO] Option: decay_method , value: noam overiding model: none
[2023-05-25 09:14:11,079 INFO] Option: warmup_steps , value: 50 overiding model: 4000
[2023-05-25 09:14:11,079 INFO] Option: log_file , value: finetuned/finetuned.log overiding model: 
[2023-05-25 09:14:11,079 INFO] Option: report_every , value: 10 overiding model: 100
[2023-05-25 09:14:11,079 INFO] Option: _all_transform , value: {'suffix', 'sentencepiece', 'filtertoolong', 'prefix'} overiding model: {'filtertoolong'}
[2023-05-25 09:14:11,079 INFO] Building model...
[2023-05-25 09:14:20,830 INFO] Adding LoRa layers for linear_values
[2023-05-25 09:14:21,629 INFO] Adding LoRa layers for linear_query
[2023-05-25 09:14:22,429 INFO] Adding LoRa layers for linear_keys
[2023-05-25 09:14:23,350 INFO] Adding LoRa layers for final_linear
[2023-05-25 09:14:34,458 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-05-25 09:14:36,144 INFO] src: 3167 new tokens
[2023-05-25 09:14:40,127 INFO] tgt: 3167 new tokens
[2023-05-25 09:14:42,640 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(259373, 1024, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0-23): 24 x TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=8192, bias=True)
          (w_2): Linear(in_features=8192, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(259373, 1024, padding_idx=1)
        )
        (pe): PositionalEncoding()
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
    (transformer_layers): ModuleList(
      (0-23): 24 x TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=1024, out_features=8192, bias=True)
          (w_2): Linear(in_features=8192, out_features=1024, bias=True)
          (layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
        (drop): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_values): Linear(in_features=1024, out_features=1024, bias=True)
          (linear_query): Linear(in_features=1024, out_features=1024, bias=True)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.1, inplace=False)
          (final_linear): Linear(in_features=1024, out_features=1024, bias=True)
        )
        (layer_norm_2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
      )
    )
  )
  (generator): Linear(in_features=1024, out_features=259373, bias=True)
)
[2023-05-25 09:14:42,648 INFO] encoder: 769727488
[2023-05-25 09:14:42,649 INFO] decoder: 605592877
[2023-05-25 09:14:42,649 INFO] * number of parameters: 1375320365
[2023-05-25 09:14:42,649 INFO]  * src vocab size = 259373
[2023-05-25 09:14:42,649 INFO]  * tgt vocab size = 259373

This fp16_optimizer is designed to only work with apex.contrib.optimizers.*
To update, use updated optimizers with AMP.
[2023-05-25 09:14:42,679 INFO] Get suffix for gmmt: {'src': '</s>', 'tgt': ''}
[2023-05-25 09:14:42,680 INFO] Get suffix for src infer: 
[2023-05-25 09:14:42,680 INFO] Get suffix for tgt infer: 
[2023-05-25 09:14:42,798 INFO] Get prefix for gmmt: {'src': 'eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-25 09:14:42,798 INFO] Get prefix for src infer: 
[2023-05-25 09:14:42,798 INFO] Get prefix for tgt infer: 
[2023-05-25 09:14:42,799 INFO] Get suffix for gmmt: {'src': '</s>', 'tgt': ''}
[2023-05-25 09:14:42,799 INFO] Get suffix for src infer: 
[2023-05-25 09:14:42,799 INFO] Get suffix for tgt infer: 
[2023-05-25 09:14:42,943 INFO] Get prefix for gmmt: {'src': 'eng_Latn', 'tgt': 'gez_Ethi'}
[2023-05-25 09:14:42,943 INFO] Get prefix for src infer: 
[2023-05-25 09:14:42,943 INFO] Get prefix for tgt infer: 
[2023-05-25 09:14:43,005 INFO] Starting training on GPU: [0]
[2023-05-25 09:14:43,005 INFO] Start training loop without validation...
[2023-05-25 09:14:43,005 INFO] Scoring with: TransformPipe()
[2023-05-25 09:17:23,356 INFO] Step 10/20000; acc: 4.2; ppl: 206400.6; xent: 12.2; lr: 0.00010; sents:    2064; bsz:  230/ 164/ 6; 460/327 tok/s;    160 sec;

Grad overflow on iteration 16
Using dynamic loss scale of 65536
[2023-05-25 09:18:46,250 INFO] Step 20/20000; acc: 4.1; ppl: 179784.1; xent: 12.1; lr: 0.00019; sents:    1900; bsz:  230/ 163/ 6; 888/629 tok/s;    243 sec;
[2023-05-25 09:18:46,332 INFO] Saving checkpoint finetuned/gez_nllb_step_20.pt
[2023-05-25 09:20:10,259 INFO] Step 30/20000; acc: 4.1; ppl: 134603.0; xent: 11.8; lr: 0.00027; sents:    1840; bsz:  227/ 161/ 6; 866/612 tok/s;    327 sec;
[2023-05-25 09:21:34,169 INFO] Step 40/20000; acc: 4.4; ppl: 92887.9; xent: 11.4; lr: 0.00036; sents:    1899; bsz:  230/ 163/ 6; 877/623 tok/s;    411 sec;
[2023-05-28 15:45:40,969 INFO] Step 8610/10000; acc: 50.3; ppl: 162.4; xent: 5.1; lr: 0.00168; sents:    1999; bsz:  230/ 163/ 6; 861/611 tok/s;  73552 sec;
[2023-05-28 15:47:06,671 INFO] Step 8620/10000; acc: 49.7; ppl: 167.2; xent: 5.1; lr: 0.00168; sents:    1939; bsz:  229/ 164/ 6; 854/611 tok/s;  73638 sec;
[2023-05-28 15:48:32,677 INFO] Step 8630/10000; acc: 50.2; ppl: 161.4; xent: 5.1; lr: 0.00168; sents:    1924; bsz:  229/ 162/ 6; 853/604 tok/s;  73724 sec;
[2023-05-28 15:49:57,772 INFO] Step 8640/10000; acc: 50.0; ppl: 165.5; xent: 5.1; lr: 0.00168; sents:    1977; bsz:  230/ 162/ 6; 865/609 tok/s;  73809 sec;
[2023-05-28 15:51:21,594 INFO] Step 8650/10000; acc: 50.3; ppl: 162.9; xent: 5.1; lr: 0.00168; sents:    1974; bsz:  231/ 162/ 6; 881/620 tok/s;  73893 sec;
[2023-05-28 15:52:47,300 INFO] Step 8660/10000; acc: 49.8; ppl: 166.8; xent: 5.1; lr: 0.00168; sents:    1989; bsz:  230/ 163/ 6; 859/607 tok/s;  73979 sec;
[2023-05-28 15:54:12,817 INFO] Step 8670/10000; acc: 49.7; ppl: 168.4; xent: 5.1; lr: 0.00168; sents:    1917; bsz:  230/ 163/ 6; 860/609 tok/s;  74064 sec;
[2023-05-28 15:55:37,979 INFO] Step 8680/10000; acc: 50.0; ppl: 162.8; xent: 5.1; lr: 0.00168; sents:    1908; bsz:  229/ 163/ 6; 861/611 tok/s;  74149 sec;
[2023-05-28 15:57:03,894 INFO] Step 8690/10000; acc: 49.9; ppl: 165.3; xent: 5.1; lr: 0.00168; sents:    1990; bsz:  231/ 162/ 6; 861/603 tok/s;  74235 sec;
[2023-05-28 15:58:29,168 INFO] Step 8700/10000; acc: 50.5; ppl: 158.1; xent: 5.1; lr: 0.00168; sents:    1991; bsz:  229/ 165/ 6; 858/621 tok/s;  74320 sec;
[2023-05-28 15:58:29,168 INFO] Train perplexity: 228.694
[2023-05-28 15:58:29,168 INFO] Train accuracy: 44.818
[2023-05-28 15:58:29,168 INFO] Sentences processed: 1.70468e+06
[2023-05-28 15:58:29,169 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 15:58:29,238 INFO] Saving checkpoint finetuned/gez_nllb_step_8700.pt
[2023-05-28 15:59:54,979 INFO] Step 8710/10000; acc: 50.7; ppl: 156.8; xent: 5.1; lr: 0.00167; sents:    1986; bsz:  229/ 162/ 6; 856/606 tok/s;  74406 sec;
[2023-05-28 16:01:20,599 INFO] Step 8720/10000; acc: 50.3; ppl: 160.0; xent: 5.1; lr: 0.00167; sents:    1927; bsz:  229/ 163/ 6; 856/610 tok/s;  74492 sec;
[2023-05-28 16:02:45,703 INFO] Step 8730/10000; acc: 49.8; ppl: 166.3; xent: 5.1; lr: 0.00167; sents:    1907; bsz:  230/ 161/ 6; 865/607 tok/s;  74577 sec;
[2023-05-28 16:04:10,567 INFO] Step 8740/10000; acc: 50.1; ppl: 162.6; xent: 5.1; lr: 0.00167; sents:    1915; bsz:  231/ 163/ 6; 870/615 tok/s;  74662 sec;
[2023-05-28 16:05:35,934 INFO] Step 8750/10000; acc: 49.6; ppl: 169.5; xent: 5.1; lr: 0.00167; sents:    1911; bsz:  230/ 162/ 6; 862/608 tok/s;  74747 sec;
[2023-05-28 16:07:01,305 INFO] Step 8760/10000; acc: 50.1; ppl: 163.9; xent: 5.1; lr: 0.00167; sents:    1971; bsz:  230/ 164/ 6; 862/614 tok/s;  74833 sec;
[2023-05-28 16:08:26,265 INFO] Step 8770/10000; acc: 50.5; ppl: 158.2; xent: 5.1; lr: 0.00167; sents:    1961; bsz:  231/ 163/ 6; 871/615 tok/s;  74917 sec;
[2023-05-28 16:09:51,397 INFO] Step 8780/10000; acc: 50.0; ppl: 164.5; xent: 5.1; lr: 0.00167; sents:    1904; bsz:  230/ 162/ 6; 864/608 tok/s;  75003 sec;
[2023-05-28 16:11:16,674 INFO] Step 8790/10000; acc: 49.8; ppl: 166.3; xent: 5.1; lr: 0.00167; sents:    1891; bsz:  229/ 162/ 6; 861/608 tok/s;  75088 sec;
[2023-05-28 16:12:42,297 INFO] Step 8800/10000; acc: 50.1; ppl: 161.5; xent: 5.1; lr: 0.00167; sents:    1903; bsz:  229/ 164/ 6; 856/612 tok/s;  75174 sec;
[2023-05-28 16:12:42,297 INFO] Train perplexity: 227.816
[2023-05-28 16:12:42,297 INFO] Train accuracy: 44.8778
[2023-05-28 16:12:42,297 INFO] Sentences processed: 1.72396e+06
[2023-05-28 16:12:42,298 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 16:12:42,367 INFO] Saving checkpoint finetuned/gez_nllb_step_8800.pt
[2023-05-28 16:14:08,989 INFO] Step 8810/10000; acc: 50.0; ppl: 164.5; xent: 5.1; lr: 0.00166; sents:    1974; bsz:  230/ 164/ 6; 849/607 tok/s;  75260 sec;
[2023-05-28 16:15:34,668 INFO] Step 8820/10000; acc: 50.3; ppl: 162.1; xent: 5.1; lr: 0.00166; sents:    1970; bsz:  231/ 163/ 6; 863/611 tok/s;  75346 sec;
[2023-05-28 16:17:00,487 INFO] Step 8830/10000; acc: 50.3; ppl: 160.2; xent: 5.1; lr: 0.00166; sents:    1993; bsz:  230/ 165/ 6; 858/615 tok/s;  75432 sec;
[2023-05-28 16:18:26,461 INFO] Step 8840/10000; acc: 50.2; ppl: 162.2; xent: 5.1; lr: 0.00166; sents:    2040; bsz:  231/ 166/ 6; 859/616 tok/s;  75518 sec;
[2023-05-28 16:19:52,630 INFO] Step 8850/10000; acc: 50.3; ppl: 161.0; xent: 5.1; lr: 0.00166; sents:    2001; bsz:  230/ 162/ 6; 853/603 tok/s;  75604 sec;
[2023-05-28 16:21:17,766 INFO] Step 8860/10000; acc: 49.9; ppl: 165.7; xent: 5.1; lr: 0.00166; sents:    1955; bsz:  230/ 162/ 6; 865/609 tok/s;  75689 sec;
[2023-05-28 16:22:43,409 INFO] Step 8870/10000; acc: 50.2; ppl: 162.5; xent: 5.1; lr: 0.00166; sents:    2037; bsz:  231/ 163/ 6; 861/608 tok/s;  75775 sec;
[2023-05-28 16:24:08,272 INFO] Step 8880/10000; acc: 50.1; ppl: 162.7; xent: 5.1; lr: 0.00166; sents:    1939; bsz:  231/ 164/ 6; 870/617 tok/s;  75859 sec;
[2023-05-28 16:25:34,449 INFO] Step 8890/10000; acc: 50.4; ppl: 159.2; xent: 5.1; lr: 0.00166; sents:    2007; bsz:  230/ 165/ 6; 855/612 tok/s;  75946 sec;
[2023-05-28 16:26:59,688 INFO] Step 8900/10000; acc: 50.0; ppl: 165.5; xent: 5.1; lr: 0.00166; sents:    1981; bsz:  229/ 164/ 6; 859/614 tok/s;  76031 sec;
[2023-05-28 16:26:59,688 INFO] Train perplexity: 226.95
[2023-05-28 16:26:59,688 INFO] Train accuracy: 44.9376
[2023-05-28 16:26:59,688 INFO] Sentences processed: 1.74386e+06
[2023-05-28 16:26:59,689 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 16:26:59,757 INFO] Saving checkpoint finetuned/gez_nllb_step_8900.pt
[2023-05-28 16:28:25,356 INFO] Step 8910/10000; acc: 50.2; ppl: 161.2; xent: 5.1; lr: 0.00166; sents:    2020; bsz:  231/ 161/ 6; 861/602 tok/s;  76117 sec;
[2023-05-28 16:29:50,084 INFO] Step 8920/10000; acc: 50.3; ppl: 160.0; xent: 5.1; lr: 0.00165; sents:    1922; bsz:  230/ 163/ 6; 869/614 tok/s;  76201 sec;
[2023-05-28 16:31:15,516 INFO] Step 8930/10000; acc: 50.0; ppl: 165.8; xent: 5.1; lr: 0.00165; sents:    1907; bsz:  229/ 162/ 6; 858/607 tok/s;  76287 sec;
[2023-05-28 16:32:40,967 INFO] Step 8940/10000; acc: 49.9; ppl: 167.5; xent: 5.1; lr: 0.00165; sents:    1969; bsz:  230/ 163/ 6; 860/610 tok/s;  76372 sec;
[2023-05-28 16:34:06,543 INFO] Step 8950/10000; acc: 50.3; ppl: 161.8; xent: 5.1; lr: 0.00165; sents:    1902; bsz:  229/ 163/ 6; 856/608 tok/s;  76458 sec;
[2023-05-28 16:35:31,394 INFO] Step 8960/10000; acc: 50.3; ppl: 160.1; xent: 5.1; lr: 0.00165; sents:    1921; bsz:  230/ 161/ 6; 866/609 tok/s;  76543 sec;
[2023-05-28 16:36:56,951 INFO] Step 8970/10000; acc: 50.3; ppl: 159.7; xent: 5.1; lr: 0.00165; sents:    1938; bsz:  229/ 163/ 6; 857/609 tok/s;  76628 sec;
[2023-05-28 16:38:21,371 INFO] Step 8980/10000; acc: 50.2; ppl: 162.0; xent: 5.1; lr: 0.00165; sents:    1986; bsz:  230/ 163/ 6; 872/618 tok/s;  76713 sec;
[2023-05-28 16:39:47,036 INFO] Step 8990/10000; acc: 50.1; ppl: 163.5; xent: 5.1; lr: 0.00165; sents:    1961; bsz:  230/ 162/ 6; 859/605 tok/s;  76798 sec;
[2023-05-28 16:41:11,379 INFO] Step 9000/10000; acc: 50.2; ppl: 161.7; xent: 5.1; lr: 0.00165; sents:    1954; bsz:  227/ 162/ 6; 862/615 tok/s;  76883 sec;
[2023-05-28 16:41:11,379 INFO] Train perplexity: 226.11
[2023-05-28 16:41:11,379 INFO] Train accuracy: 44.9954
[2023-05-28 16:41:11,379 INFO] Sentences processed: 1.76334e+06
[2023-05-28 16:41:11,379 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 16:41:11,469 INFO] Saving checkpoint finetuned/gez_nllb_step_9000.pt
[2023-05-28 16:42:37,295 INFO] Step 9010/10000; acc: 49.6; ppl: 167.9; xent: 5.1; lr: 0.00165; sents:    1876; bsz:  228/ 162/ 6; 849/602 tok/s;  76969 sec;
[2023-05-28 16:44:02,407 INFO] Step 9020/10000; acc: 50.0; ppl: 164.3; xent: 5.1; lr: 0.00165; sents:    1951; bsz:  230/ 164/ 6; 865/616 tok/s;  77054 sec;
[2023-05-28 16:45:28,035 INFO] Step 9030/10000; acc: 50.5; ppl: 159.4; xent: 5.1; lr: 0.00164; sents:    1939; bsz:  230/ 163/ 6; 858/611 tok/s;  77139 sec;
[2023-05-28 16:46:53,263 INFO] Step 9040/10000; acc: 49.5; ppl: 169.6; xent: 5.1; lr: 0.00164; sents:    1936; bsz:  230/ 163/ 6; 863/611 tok/s;  77224 sec;
[2023-05-28 16:48:18,829 INFO] Step 9050/10000; acc: 50.6; ppl: 158.2; xent: 5.1; lr: 0.00164; sents:    2060; bsz:  232/ 164/ 6; 866/615 tok/s;  77310 sec;
[2023-05-28 16:49:44,270 INFO] Step 9060/10000; acc: 50.3; ppl: 162.0; xent: 5.1; lr: 0.00164; sents:    1991; bsz:  230/ 163/ 6; 863/609 tok/s;  77395 sec;
[2023-05-28 16:51:09,727 INFO] Step 9070/10000; acc: 50.2; ppl: 161.5; xent: 5.1; lr: 0.00164; sents:    1946; bsz:  230/ 164/ 6; 862/613 tok/s;  77481 sec;
[2023-05-28 16:52:35,248 INFO] Step 9080/10000; acc: 49.9; ppl: 163.8; xent: 5.1; lr: 0.00164; sents:    1899; bsz:  230/ 164/ 6; 862/615 tok/s;  77566 sec;
[2023-05-28 16:54:00,738 INFO] Step 9090/10000; acc: 50.2; ppl: 162.4; xent: 5.1; lr: 0.00164; sents:    1917; bsz:  230/ 162/ 6; 860/607 tok/s;  77652 sec;
[2023-05-28 16:55:25,583 INFO] Step 9100/10000; acc: 49.9; ppl: 166.1; xent: 5.1; lr: 0.00164; sents:    1946; bsz:  230/ 161/ 6; 868/609 tok/s;  77737 sec;
[2023-05-28 16:55:25,583 INFO] Train perplexity: 225.306
[2023-05-28 16:55:25,583 INFO] Train accuracy: 45.051
[2023-05-28 16:55:25,583 INFO] Sentences processed: 1.7828e+06
[2023-05-28 16:55:25,583 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 16:55:25,652 INFO] Saving checkpoint finetuned/gez_nllb_step_9100.pt
[2023-05-28 16:56:52,245 INFO] Step 9110/10000; acc: 50.1; ppl: 162.1; xent: 5.1; lr: 0.00164; sents:    1954; bsz:  229/ 163/ 6; 846/600 tok/s;  77823 sec;
[2023-05-28 16:58:17,570 INFO] Step 9120/10000; acc: 50.7; ppl: 156.7; xent: 5.1; lr: 0.00164; sents:    1999; bsz:  229/ 162/ 6; 859/609 tok/s;  77909 sec;
[2023-05-28 16:59:43,448 INFO] Step 9130/10000; acc: 50.3; ppl: 161.3; xent: 5.1; lr: 0.00164; sents:    1957; bsz:  230/ 163/ 6; 857/607 tok/s;  77995 sec;
[2023-05-28 17:01:08,176 INFO] Step 9140/10000; acc: 50.5; ppl: 159.1; xent: 5.1; lr: 0.00163; sents:    1942; bsz:  229/ 162/ 6; 863/611 tok/s;  78079 sec;
[2023-05-28 17:02:33,747 INFO] Step 9150/10000; acc: 50.1; ppl: 162.3; xent: 5.1; lr: 0.00163; sents:    1867; bsz:  227/ 161/ 6; 850/602 tok/s;  78165 sec;
[2023-05-28 17:03:58,872 INFO] Step 9160/10000; acc: 50.1; ppl: 162.4; xent: 5.1; lr: 0.00163; sents:    1966; bsz:  231/ 163/ 6; 869/614 tok/s;  78250 sec;
[2023-05-28 17:05:24,318 INFO] Step 9170/10000; acc: 50.2; ppl: 161.0; xent: 5.1; lr: 0.00163; sents:    1940; bsz:  229/ 163/ 6; 859/609 tok/s;  78336 sec;
[2023-05-28 17:06:50,019 INFO] Step 9180/10000; acc: 50.1; ppl: 161.8; xent: 5.1; lr: 0.00163; sents:    1903; bsz:  230/ 162/ 6; 859/604 tok/s;  78421 sec;
[2023-05-28 17:08:16,154 INFO] Step 9190/10000; acc: 50.1; ppl: 163.2; xent: 5.1; lr: 0.00163; sents:    1900; bsz:  229/ 163/ 6; 851/605 tok/s;  78507 sec;
[2023-05-28 17:09:40,871 INFO] Step 9200/10000; acc: 49.8; ppl: 165.6; xent: 5.1; lr: 0.00163; sents:    1861; bsz:  229/ 162/ 6; 865/612 tok/s;  78592 sec;
[2023-05-28 17:09:40,871 INFO] Train perplexity: 224.497
[2023-05-28 17:09:40,871 INFO] Train accuracy: 45.1068
[2023-05-28 17:09:40,871 INFO] Sentences processed: 1.80209e+06
[2023-05-28 17:09:40,871 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 17:09:40,946 INFO] Saving checkpoint finetuned/gez_nllb_step_9200.pt
[2023-05-28 17:11:07,457 INFO] Step 9210/10000; acc: 50.5; ppl: 157.6; xent: 5.1; lr: 0.00163; sents:    2003; bsz:  231/ 165/ 6; 854/611 tok/s;  78679 sec;
[2023-05-28 17:12:32,761 INFO] Step 9220/10000; acc: 50.0; ppl: 163.4; xent: 5.1; lr: 0.00163; sents:    1987; bsz:  231/ 163/ 6; 866/613 tok/s;  78764 sec;
[2023-05-28 17:13:58,709 INFO] Step 9230/10000; acc: 50.6; ppl: 159.1; xent: 5.1; lr: 0.00163; sents:    2110; bsz:  233/ 167/ 7; 869/621 tok/s;  78850 sec;
[2023-05-28 17:15:24,034 INFO] Step 9240/10000; acc: 50.6; ppl: 158.7; xent: 5.1; lr: 0.00163; sents:    2031; bsz:  229/ 162/ 6; 859/607 tok/s;  78935 sec;
[2023-05-28 17:16:50,159 INFO] Step 9250/10000; acc: 49.9; ppl: 163.0; xent: 5.1; lr: 0.00162; sents:    1911; bsz:  228/ 162/ 6; 848/602 tok/s;  79021 sec;
[2023-05-28 17:18:15,086 INFO] Step 9260/10000; acc: 50.3; ppl: 161.1; xent: 5.1; lr: 0.00162; sents:    1995; bsz:  232/ 164/ 6; 875/617 tok/s;  79106 sec;
[2023-05-28 17:19:40,974 INFO] Step 9270/10000; acc: 50.3; ppl: 159.8; xent: 5.1; lr: 0.00162; sents:    1952; bsz:  231/ 163/ 6; 861/606 tok/s;  79192 sec;
[2023-05-28 17:21:05,756 INFO] Step 9280/10000; acc: 50.8; ppl: 156.1; xent: 5.1; lr: 0.00162; sents:    2024; bsz:  230/ 165/ 6; 867/624 tok/s;  79277 sec;
[2023-05-28 17:22:30,851 INFO] Step 9290/10000; acc: 50.1; ppl: 163.3; xent: 5.1; lr: 0.00162; sents:    1978; bsz:  230/ 164/ 6; 865/618 tok/s;  79362 sec;
[2023-05-28 17:23:55,098 INFO] Step 9300/10000; acc: 50.5; ppl: 160.7; xent: 5.1; lr: 0.00162; sents:    2033; bsz:  230/ 163/ 6; 873/619 tok/s;  79446 sec;
[2023-05-28 17:23:55,098 INFO] Train perplexity: 223.681
[2023-05-28 17:23:55,098 INFO] Train accuracy: 45.1635
[2023-05-28 17:23:55,098 INFO] Sentences processed: 1.82211e+06
[2023-05-28 17:23:55,098 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 17:23:55,167 INFO] Saving checkpoint finetuned/gez_nllb_step_9300.pt
[2023-05-28 17:25:21,775 INFO] Step 9310/10000; acc: 50.3; ppl: 162.0; xent: 5.1; lr: 0.00162; sents:    2004; bsz:  230/ 164/ 6; 849/607 tok/s;  79533 sec;
[2023-05-28 17:26:46,010 INFO] Step 9320/10000; acc: 50.5; ppl: 159.9; xent: 5.1; lr: 0.00162; sents:    2029; bsz:  230/ 165/ 6; 875/628 tok/s;  79617 sec;
[2023-05-28 17:28:11,381 INFO] Step 9330/10000; acc: 50.4; ppl: 159.1; xent: 5.1; lr: 0.00162; sents:    1975; bsz:  229/ 163/ 6; 859/612 tok/s;  79703 sec;
[2023-05-28 17:29:36,428 INFO] Step 9340/10000; acc: 50.7; ppl: 157.3; xent: 5.1; lr: 0.00162; sents:    2070; bsz:  230/ 165/ 6; 865/622 tok/s;  79788 sec;
[2023-05-28 17:31:02,424 INFO] Step 9350/10000; acc: 50.2; ppl: 162.0; xent: 5.1; lr: 0.00162; sents:    1946; bsz:  228/ 163/ 6; 849/608 tok/s;  79874 sec;
[2023-05-28 17:32:27,756 INFO] Step 9360/10000; acc: 50.4; ppl: 161.8; xent: 5.1; lr: 0.00161; sents:    2065; bsz:  232/ 166/ 6; 868/621 tok/s;  79959 sec;
[2023-05-28 17:33:52,313 INFO] Step 9370/10000; acc: 50.2; ppl: 162.5; xent: 5.1; lr: 0.00161; sents:    1986; bsz:  230/ 164/ 6; 869/620 tok/s;  80044 sec;
[2023-05-28 17:35:17,813 INFO] Step 9380/10000; acc: 50.6; ppl: 159.2; xent: 5.1; lr: 0.00161; sents:    2047; bsz:  231/ 165/ 6; 864/618 tok/s;  80129 sec;
[2023-05-28 17:36:43,670 INFO] Step 9390/10000; acc: 50.1; ppl: 162.3; xent: 5.1; lr: 0.00161; sents:    1912; bsz:  230/ 162/ 6; 858/605 tok/s;  80215 sec;
[2023-05-28 17:38:08,693 INFO] Step 9400/10000; acc: 50.2; ppl: 161.2; xent: 5.1; lr: 0.00161; sents:    1979; bsz:  229/ 162/ 6; 863/611 tok/s;  80300 sec;
[2023-05-28 17:38:08,693 INFO] Train perplexity: 222.891
[2023-05-28 17:38:08,693 INFO] Train accuracy: 45.2191
[2023-05-28 17:38:08,693 INFO] Sentences processed: 1.84212e+06
[2023-05-28 17:38:08,693 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 17:38:08,762 INFO] Saving checkpoint finetuned/gez_nllb_step_9400.pt
[2023-05-28 17:39:35,610 INFO] Step 9410/10000; acc: 50.7; ppl: 156.3; xent: 5.1; lr: 0.00161; sents:    2009; bsz:  230/ 164/ 6; 847/604 tok/s;  80387 sec;
[2023-05-28 17:41:00,534 INFO] Step 9420/10000; acc: 50.0; ppl: 164.1; xent: 5.1; lr: 0.00161; sents:    1909; bsz:  229/ 162/ 6; 864/611 tok/s;  80472 sec;
[2023-05-28 17:42:25,895 INFO] Step 9430/10000; acc: 50.2; ppl: 161.5; xent: 5.1; lr: 0.00161; sents:    1935; bsz:  229/ 163/ 6; 860/610 tok/s;  80557 sec;
[2023-05-28 17:43:50,192 INFO] Step 9440/10000; acc: 50.2; ppl: 162.1; xent: 5.1; lr: 0.00161; sents:    1961; bsz:  230/ 163/ 6; 873/618 tok/s;  80641 sec;
[2023-05-28 17:45:14,933 INFO] Step 9450/10000; acc: 50.5; ppl: 159.6; xent: 5.1; lr: 0.00161; sents:    1986; bsz:  229/ 164/ 6; 865/620 tok/s;  80726 sec;
[2023-05-28 17:46:40,153 INFO] Step 9460/10000; acc: 50.4; ppl: 160.2; xent: 5.1; lr: 0.00161; sents:    1907; bsz:  229/ 163/ 6; 861/611 tok/s;  80811 sec;
[2023-05-28 17:48:05,543 INFO] Step 9470/10000; acc: 50.4; ppl: 160.0; xent: 5.1; lr: 0.00161; sents:    1978; bsz:  228/ 162/ 6; 854/609 tok/s;  80897 sec;
[2023-05-28 17:49:30,664 INFO] Step 9480/10000; acc: 50.2; ppl: 161.0; xent: 5.1; lr: 0.00160; sents:    1996; bsz:  230/ 165/ 6; 866/621 tok/s;  80982 sec;
[2023-05-28 17:50:56,149 INFO] Step 9490/10000; acc: 50.2; ppl: 161.4; xent: 5.1; lr: 0.00160; sents:    1971; bsz:  231/ 163/ 6; 864/610 tok/s;  81067 sec;
[2023-05-28 17:52:20,552 INFO] Step 9500/10000; acc: 50.9; ppl: 156.0; xent: 5.0; lr: 0.00160; sents:    2042; bsz:  231/ 165/ 6; 877/625 tok/s;  81152 sec;
[2023-05-28 17:52:20,552 INFO] Train perplexity: 222.116
[2023-05-28 17:52:20,552 INFO] Train accuracy: 45.2734
[2023-05-28 17:52:20,552 INFO] Sentences processed: 1.86182e+06
[2023-05-28 17:52:20,552 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 17:52:20,619 INFO] Saving checkpoint finetuned/gez_nllb_step_9500.pt
[2023-05-28 17:53:46,684 INFO] Step 9510/10000; acc: 50.1; ppl: 162.5; xent: 5.1; lr: 0.00160; sents:    1984; bsz:  231/ 164/ 6; 857/609 tok/s;  81238 sec;
[2023-05-28 17:55:12,135 INFO] Step 9520/10000; acc: 50.3; ppl: 160.5; xent: 5.1; lr: 0.00160; sents:    1938; bsz:  230/ 162/ 6; 861/608 tok/s;  81323 sec;
[2023-05-28 17:56:36,638 INFO] Step 9530/10000; acc: 50.1; ppl: 163.6; xent: 5.1; lr: 0.00160; sents:    1959; bsz:  231/ 162/ 6; 875/615 tok/s;  81408 sec;
[2023-05-28 17:58:01,589 INFO] Step 9540/10000; acc: 50.6; ppl: 156.1; xent: 5.1; lr: 0.00160; sents:    1939; bsz:  230/ 163/ 6; 866/615 tok/s;  81493 sec;
[2023-05-28 17:59:27,034 INFO] Step 9550/10000; acc: 50.1; ppl: 162.1; xent: 5.1; lr: 0.00160; sents:    1905; bsz:  231/ 164/ 6; 866/615 tok/s;  81578 sec;
[2023-05-28 18:00:52,543 INFO] Step 9560/10000; acc: 50.0; ppl: 163.1; xent: 5.1; lr: 0.00160; sents:    1979; bsz:  230/ 164/ 6; 861/612 tok/s;  81664 sec;
[2023-05-28 18:02:17,929 INFO] Step 9570/10000; acc: 49.9; ppl: 164.4; xent: 5.1; lr: 0.00160; sents:    1953; bsz:  230/ 163/ 6; 863/612 tok/s;  81749 sec;
[2023-05-28 18:03:42,657 INFO] Step 9580/10000; acc: 50.5; ppl: 157.7; xent: 5.1; lr: 0.00160; sents:    2006; bsz:  230/ 165/ 6; 867/622 tok/s;  81834 sec;
[2023-05-28 18:05:07,579 INFO] Step 9590/10000; acc: 50.3; ppl: 162.1; xent: 5.1; lr: 0.00160; sents:    2024; bsz:  231/ 165/ 6; 871/621 tok/s;  81919 sec;
[2023-05-28 18:06:32,705 INFO] Step 9600/10000; acc: 50.0; ppl: 164.6; xent: 5.1; lr: 0.00159; sents:    1922; bsz:  229/ 162/ 6; 862/610 tok/s;  82004 sec;
[2023-05-28 18:06:32,705 INFO] Train perplexity: 221.38
[2023-05-28 18:06:32,705 INFO] Train accuracy: 45.3247
[2023-05-28 18:06:32,705 INFO] Sentences processed: 1.88143e+06
[2023-05-28 18:06:32,705 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 18:06:32,775 INFO] Saving checkpoint finetuned/gez_nllb_step_9600.pt
[2023-05-28 18:07:59,194 INFO] Step 9610/10000; acc: 49.9; ppl: 164.8; xent: 5.1; lr: 0.00159; sents:    1895; bsz:  230/ 162/ 6; 850/598 tok/s;  82090 sec;
[2023-05-28 18:09:24,863 INFO] Step 9620/10000; acc: 50.1; ppl: 162.6; xent: 5.1; lr: 0.00159; sents:    1912; bsz:  230/ 163/ 6; 859/608 tok/s;  82176 sec;
[2023-05-28 18:10:50,634 INFO] Step 9630/10000; acc: 50.2; ppl: 161.5; xent: 5.1; lr: 0.00159; sents:    1958; bsz:  230/ 163/ 6; 858/607 tok/s;  82262 sec;
[2023-05-28 18:12:15,182 INFO] Step 9640/10000; acc: 50.1; ppl: 163.5; xent: 5.1; lr: 0.00159; sents:    1942; bsz:  229/ 162/ 6; 868/613 tok/s;  82346 sec;
[2023-05-28 18:13:40,566 INFO] Step 9650/10000; acc: 50.4; ppl: 161.4; xent: 5.1; lr: 0.00159; sents:    2000; bsz:  230/ 162/ 6; 861/607 tok/s;  82432 sec;
[2023-05-28 18:15:06,238 INFO] Step 9660/10000; acc: 50.4; ppl: 160.4; xent: 5.1; lr: 0.00159; sents:    1955; bsz:  230/ 163/ 6; 860/608 tok/s;  82517 sec;
[2023-05-28 18:16:32,315 INFO] Step 9670/10000; acc: 50.2; ppl: 162.8; xent: 5.1; lr: 0.00159; sents:    1950; bsz:  230/ 163/ 6; 854/606 tok/s;  82604 sec;
[2023-05-28 18:17:57,754 INFO] Step 9680/10000; acc: 50.4; ppl: 161.3; xent: 5.1; lr: 0.00159; sents:    2035; bsz:  230/ 163/ 6; 862/611 tok/s;  82689 sec;
[2023-05-28 18:19:23,437 INFO] Step 9690/10000; acc: 50.7; ppl: 157.5; xent: 5.1; lr: 0.00159; sents:    1931; bsz:  229/ 164/ 6; 856/612 tok/s;  82775 sec;
[2023-05-28 18:20:49,180 INFO] Step 9700/10000; acc: 50.2; ppl: 162.9; xent: 5.1; lr: 0.00159; sents:    2033; bsz:  230/ 165/ 6; 859/614 tok/s;  82860 sec;
[2023-05-28 18:20:49,180 INFO] Train perplexity: 220.668
[2023-05-28 18:20:49,180 INFO] Train accuracy: 45.3754
[2023-05-28 18:20:49,180 INFO] Sentences processed: 1.90104e+06
[2023-05-28 18:20:49,180 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 18:20:49,308 INFO] Saving checkpoint finetuned/gez_nllb_step_9700.pt
[2023-05-28 18:22:15,926 INFO] Step 9710/10000; acc: 49.7; ppl: 167.3; xent: 5.1; lr: 0.00159; sents:    1868; bsz:  229/ 161/ 6; 847/593 tok/s;  82947 sec;
[2023-05-28 18:23:41,081 INFO] Step 9720/10000; acc: 49.6; ppl: 167.5; xent: 5.1; lr: 0.00158; sents:    1841; bsz:  228/ 160/ 6; 856/601 tok/s;  83032 sec;
[2023-05-28 18:25:07,240 INFO] Step 9730/10000; acc: 50.4; ppl: 159.5; xent: 5.1; lr: 0.00158; sents:    2025; bsz:  230/ 164/ 6; 854/610 tok/s;  83118 sec;
[2023-05-28 18:26:32,401 INFO] Step 9740/10000; acc: 51.0; ppl: 152.6; xent: 5.0; lr: 0.00158; sents:    2068; bsz:  230/ 165/ 6; 866/619 tok/s;  83204 sec;
[2023-05-28 18:27:58,751 INFO] Step 9750/10000; acc: 50.8; ppl: 156.0; xent: 5.0; lr: 0.00158; sents:    2049; bsz:  231/ 164/ 6; 855/609 tok/s;  83290 sec;
[2023-05-28 18:29:23,735 INFO] Step 9760/10000; acc: 50.1; ppl: 161.4; xent: 5.1; lr: 0.00158; sents:    1927; bsz:  230/ 161/ 6; 865/607 tok/s;  83375 sec;
[2023-05-28 18:30:49,932 INFO] Step 9770/10000; acc: 51.0; ppl: 155.3; xent: 5.0; lr: 0.00158; sents:    2102; bsz:  232/ 166/ 7; 861/616 tok/s;  83461 sec;
[2023-05-28 18:32:14,138 INFO] Step 9780/10000; acc: 49.9; ppl: 162.9; xent: 5.1; lr: 0.00158; sents:    1799; bsz:  229/ 162/ 6; 870/615 tok/s;  83545 sec;
[2023-05-28 18:33:39,917 INFO] Step 9790/10000; acc: 50.0; ppl: 163.2; xent: 5.1; lr: 0.00158; sents:    1880; bsz:  229/ 161/ 6; 854/600 tok/s;  83631 sec;
[2023-05-28 18:35:05,244 INFO] Step 9800/10000; acc: 50.1; ppl: 162.4; xent: 5.1; lr: 0.00158; sents:    1943; bsz:  229/ 162/ 6; 860/607 tok/s;  83716 sec;
[2023-05-28 18:35:05,244 INFO] Train perplexity: 219.957
[2023-05-28 18:35:05,244 INFO] Train accuracy: 45.4251
[2023-05-28 18:35:05,244 INFO] Sentences processed: 1.92054e+06
[2023-05-28 18:35:05,244 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 18:35:05,313 INFO] Saving checkpoint finetuned/gez_nllb_step_9800.pt
[2023-05-28 18:36:31,778 INFO] Step 9810/10000; acc: 50.3; ppl: 159.2; xent: 5.1; lr: 0.00158; sents:    1913; bsz:  229/ 162/ 6; 846/599 tok/s;  83803 sec;
[2023-05-28 18:37:56,514 INFO] Step 9820/10000; acc: 50.3; ppl: 162.3; xent: 5.1; lr: 0.00158; sents:    1916; bsz:  230/ 162/ 6; 870/611 tok/s;  83888 sec;
[2023-05-28 18:39:22,406 INFO] Step 9830/10000; acc: 50.5; ppl: 159.1; xent: 5.1; lr: 0.00158; sents:    1883; bsz:  228/ 161/ 6; 850/602 tok/s;  83974 sec;
[2023-05-28 18:40:47,065 INFO] Step 9840/10000; acc: 50.5; ppl: 158.3; xent: 5.1; lr: 0.00158; sents:    1999; bsz:  231/ 164/ 6; 872/620 tok/s;  84058 sec;
[2023-05-28 18:42:13,503 INFO] Step 9850/10000; acc: 49.6; ppl: 165.8; xent: 5.1; lr: 0.00157; sents:    1915; bsz:  230/ 162/ 6; 851/601 tok/s;  84145 sec;
[2023-05-28 18:43:38,606 INFO] Step 9860/10000; acc: 50.1; ppl: 160.8; xent: 5.1; lr: 0.00157; sents:    1896; bsz:  230/ 162/ 6; 863/609 tok/s;  84230 sec;
[2023-05-28 18:45:05,106 INFO] Step 9870/10000; acc: 50.6; ppl: 157.2; xent: 5.1; lr: 0.00157; sents:    2007; bsz:  230/ 164/ 6; 852/608 tok/s;  84316 sec;
[2023-05-28 18:46:29,570 INFO] Step 9880/10000; acc: 50.6; ppl: 157.7; xent: 5.1; lr: 0.00157; sents:    2023; bsz:  231/ 163/ 6; 874/619 tok/s;  84401 sec;
[2023-05-28 18:47:55,503 INFO] Step 9890/10000; acc: 49.5; ppl: 167.2; xent: 5.1; lr: 0.00157; sents:    1862; bsz:  229/ 161/ 6; 853/600 tok/s;  84487 sec;
[2023-05-28 18:49:20,457 INFO] Step 9900/10000; acc: 49.9; ppl: 165.8; xent: 5.1; lr: 0.00157; sents:    1942; bsz:  230/ 164/ 6; 868/617 tok/s;  84572 sec;
[2023-05-28 18:49:20,457 INFO] Train perplexity: 219.271
[2023-05-28 18:49:20,457 INFO] Train accuracy: 45.4731
[2023-05-28 18:49:20,457 INFO] Sentences processed: 1.9399e+06
[2023-05-28 18:49:20,457 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 18:49:20,528 INFO] Saving checkpoint finetuned/gez_nllb_step_9900.pt
[2023-05-28 18:50:47,472 INFO] Step 9910/10000; acc: 50.7; ppl: 157.8; xent: 5.1; lr: 0.00157; sents:    2030; bsz:  230/ 163/ 6; 847/598 tok/s;  84659 sec;
[2023-05-28 18:52:12,491 INFO] Step 9920/10000; acc: 49.7; ppl: 166.5; xent: 5.1; lr: 0.00157; sents:    1903; bsz:  230/ 161/ 6; 866/607 tok/s;  84744 sec;
[2023-05-28 18:53:38,553 INFO] Step 9930/10000; acc: 50.4; ppl: 158.2; xent: 5.1; lr: 0.00157; sents:    1969; bsz:  230/ 164/ 6; 857/610 tok/s;  84830 sec;
[2023-05-28 18:55:03,763 INFO] Step 9940/10000; acc: 49.8; ppl: 164.8; xent: 5.1; lr: 0.00157; sents:    1945; bsz:  231/ 163/ 6; 866/613 tok/s;  84915 sec;
[2023-05-28 18:56:29,473 INFO] Step 9950/10000; acc: 50.8; ppl: 153.9; xent: 5.0; lr: 0.00157; sents:    1928; bsz:  229/ 161/ 6; 855/602 tok/s;  85001 sec;
[2023-05-28 18:57:54,699 INFO] Step 9960/10000; acc: 50.5; ppl: 159.4; xent: 5.1; lr: 0.00157; sents:    1943; bsz:  231/ 164/ 6; 868/615 tok/s;  85086 sec;
[2023-05-28 18:59:20,309 INFO] Step 9970/10000; acc: 50.2; ppl: 161.0; xent: 5.1; lr: 0.00156; sents:    1907; bsz:  229/ 161/ 6; 858/603 tok/s;  85172 sec;
[2023-05-28 19:00:46,041 INFO] Step 9980/10000; acc: 49.9; ppl: 166.5; xent: 5.1; lr: 0.00156; sents:    1949; bsz:  230/ 162/ 6; 857/606 tok/s;  85257 sec;
[2023-05-28 19:02:11,766 INFO] Step 9990/10000; acc: 49.9; ppl: 165.5; xent: 5.1; lr: 0.00156; sents:    1958; bsz:  231/ 164/ 6; 861/612 tok/s;  85343 sec;
[2023-05-28 19:03:36,711 INFO] Step 10000/10000; acc: 50.6; ppl: 157.8; xent: 5.1; lr: 0.00156; sents:    1997; bsz:  231/ 163/ 6; 869/614 tok/s;  85428 sec;
[2023-05-28 19:03:36,711 INFO] Train perplexity: 218.598
[2023-05-28 19:03:36,711 INFO] Train accuracy: 45.5208
[2023-05-28 19:03:36,711 INFO] Sentences processed: 1.95942e+06
[2023-05-28 19:03:36,711 INFO] Average bsz:  230/ 163/ 6
[2023-05-28 19:03:36,790 INFO] Saving checkpoint finetuned/gez_nllb_step_10000.pt

The training went good but during the inference the translation is not as I expected. I checked the model at different steps. The translation was good up to around the 500th step but starting from that the performance deteriorates, It starts adding some latin characters. At the end of training(10000 steps) the translation is some meaningless latin characters with few Geez words. I’ve checked the translation on the training data as well but the translation is the same bad. It can be about the training data size but I’m confused why the translation is that bad for the training data as well even though the training accuracy had improved. Here is a sample translation

ሰደበት ሃለበት ለዓመታት ለዓመታት ለዓለም።
ቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈቈ ።
ã እግåãብሔርರዕዕ በትሮተ አሮን እምቅድመ ምስክር ወያዕቅቦ ከመ ምልክት against rebellen ወታዕቅቦ ከንጐርጐረዎ ከንጐርጐረየ ከመ አይሞቱ ።
ወbɛተዋሂተ እም Moab ወbɛተዋሂተ ମድያንም ሰጠቱ ሰጠተ ምግምተ ወbɛተዋሂተ ወbɛተዋሂተ ምግምተ ወbɛተዋሂተ ወbɛተዋዕቱ ለቈላመ ወbɛተዋዕቱ ለቈላመ ወbɛተዋዕቱ ለቈላመ ።
ጸውዕከ ጸድኦ ለዓለምን፤ ወቈጽዓከ ጸድኦ ለዓለምን።
ተነግರ pɛ pɛ char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char char
ከመ ይፈልሕ ቀስ፤ ከመ ይፈልሕ ፀይፍ በመዋኢ። ከመ ይፈልሕ ጸይፍ በመዋኢ።
ወይደሰት ምድር ሰናብታ until ሃለወት ወمتክኑ አንትሙ በመዋዕል ፀሩ ወمتካዕ ምድር ሰናብታ ወمتካዕ ሰናብታ ።
ጸሓፉነ እምነቢተ ሰሌተ እምነቢተ እምነቢተ እምነቢተ እምነቢተ እምነቢተ እምነቢተ እምነቢተ እምነቢተ እምነቢተ ።
sā sā sā sāដዕከ ሰሌዕከ ወጻዕዕዕከ ሰሌዕ ወቈዕከ ሰሌዕ ወቈዕከ ሰሌዕ ወቈዕከ ሰሌዕ ።
እግåãብሔር ዕቱብ በመቅደሳ፤ ወمتዕዝነ ተሣህለ ተሣህል።
voici ቃለ ሕጉ ዘ Orderኦ እግåãብሔርേ Moses ከመ ይገብره pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ pɛ mokoбобоብ ዘገብرهوب ።
tawaዕቅቦ ያመልስዎ ወይ whyብሮ പരድኦ until year yamቱተ ወይወፅኦ በዓላት yamቱተ ወይዕቅቦ እምርስቱ ።
uziዕሮ አሮና Kuli ዳዊት አእንግሊተ ንጉሥ ወይዕብዮ ዝወደቦ እåãዕ እåãዕዝ ለእሥራት ወቈርቈተ እåãዕዝ ወቈርቈተ እåãዕዝ እåãዕዝ ለእ wood ።
From-30 years until $560 years 

probably overfitting. again your dataset size is key.

I will check it increasing the dropout and change the dropout to be at each step instead of a fixed number of steps.

dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
dropout_steps: [0]
dropout: [0.3]
attention_dropout: [0.3]

Please… Is that the right position to put the language tokens, I thought the language token of the target should be at start of the source sentence, here I’m having the src language token at the src and the tgt language token at the tgt.

For instance, consider the following
English→Spanish pair of sentences: Hello, how are you? → Hola, ¿cómo estás? It will be
modified to: <2es> Hello, how are you? → Hola, ¿cómo estás? to indicate that Spanish
is the target language. The source language is not specified but the model will learn this
automatically.