I have followed those steps and I keep getting so high perplexity values at the beginning of the training. Does that make sense? I am using flores200 test set just to check if the fine-tuning works.
I use the following config file:
Vocab options
share_vocab: true
Where the vocab(s) is
src_vocab: ādictionary.txtā
tgt_vocab: ādictionary.txtā
src_words_min_frequency: 1
src_vocab_size: 257000
tgt_words_min_frequency: 1
tgt_vocab_size: 257000
src_vocab_multiple: 8
save_data: ānllb-200ā
Corpus opts:
data:
corpus_1:
path_src: āflores200_dataset/devtest/eng_Latn.devtestā
path_tgt: āflores200_dataset/devtest/spa_Latn.devtestā
transforms: [sentencepiece, prefix, suffix, filtertoolong]
weight: 10
src_prefix: āā
tgt_prefix: āspa_Latnā
src_suffix: " eng_Latn"
tgt_suffix: āā
Subword
src_subword_model: āflores200_sacrebleu_tokenizer_spm.modelā
tgt_subword_model: āflores200_sacrebleu_tokenizer_spm.modelā
General opts
save_model: ātrained_models/nllb-200-600M-onmt-1ā
train_from: ānllb-200-600M-onmt.ptā
update_vocab: true
reset_optim: all
Model
encoder_type: transformer
decoder_type: transformer
enc_layers: 12
dec_layers: 12
heads: 16
hidden_size: 1024
word_vec_size: 1024
transformer_ff: 4096
dropout_steps: [0, 15000, 30000]
dropout: [0.1, 0.1, 0.1]
attention_dropout: [0.1, 0.1, 0.1]
share_decoder_embeddings: true
share_embeddings: true
position_encoding: true
position_encoding_type: āSinusoidalConcatā
decoder_start_token: āā
NEW OPTIONS
Filter
src_seq_length: 200
tgt_seq_length: 200
report_every: 1
train_steps: 2500
valid_steps: 500
save_checkpoint_steps: 250
log_file: ātrain.logā
Batching
bucket_size: 262144
world_size: 1
gpu_ranks: [0]
num_workers: 1
batch_type: ātokensā
batch_size: 1024
valid_batch_size: 2048
batch_size_multiple: 4
accum_count: [12]
accum_steps: [0]
Optimization
optim: āsgdā
learning_rate: 0.05
label_smoothing: 0.1
The following command:
onmt_train -config config.yaml
And I am getting the following training logs:
[2023-02-22 08:25:04,273 INFO] Parsed 1 corpora from -data.
[2023-02-22 08:25:04,274 INFO] Loading checkpoint from nllb-200-600M-onmt.pt
[2023-02-22 08:25:06,588 WARNING] configured transforms is different from checkpoint: +{āsuffixā, āsentencepieceā, āprefixā}
[2023-02-22 08:25:06,588 INFO] Get suffix for corpus_1: {āsrcā: ā eng_Latnā, ātgtā: āā}
[2023-02-22 08:25:06,588 INFO] Get prefix for corpus_1: {āsrcā: āā, ātgtā: āspa_Latnā}
[2023-02-22 08:25:06,588 INFO] Get prefix for src infer:
[2023-02-22 08:25:06,588 INFO] Get prefix for tgt infer:
[2023-02-22 08:25:06,588 INFO] Get special vocabs from Transforms: {āsrcā: [āeng_Latnā], ātgtā: [āspa_Latnā]}.
[2023-02-22 08:25:07,440 INFO] Updating checkpoint vocabulary with new vocabulary
[2023-02-22 08:25:07,443 INFO] Get suffix for corpus_1: {āsrcā: ā eng_Latnā, ātgtā: āā}
[2023-02-22 08:25:07,446 INFO] Get prefix for corpus_1: {āsrcā: āā, ātgtā: āspa_Latnā}
[2023-02-22 08:25:07,449 INFO] Get prefix for src infer:
[2023-02-22 08:25:07,452 INFO] Get prefix for tgt infer:
[2023-02-22 08:25:07,455 INFO] Get special vocabs from Transforms: {āsrcā: [āeng_Latnā], ātgtā: [āspa_Latnā]}.
[2023-02-22 08:25:08,435 INFO] Building modelā¦
[2023-02-22 08:25:28,822 INFO] Updating vocabulary embeddings with checkpoint embeddings
[2023-02-22 08:32:45,470 INFO] src: 2 new tokens
[2023-02-22 08:40:24,422 INFO] tgt: 2 new tokens
[2023-02-22 08:40:29,150 INFO] NMTModel(
(encoder): TransformerEncoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(256208, 1024, padding_idx=1)
)
(pe): PositionalEncoding()
)
(dropout): Dropout(p=0.1, inplace=False)
)
(transformer): ModuleList(
(0): TransformerEncoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
(linear_values): Linear(in_features=1024, out_features=1024, bias=True)
(linear_query): Linear(in_features=1024, out_features=1024, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=1024, out_features=4096, bias=True)
(w_2): Linear(in_features=4096, out_features=1024, bias=True)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
ā¦
(11): TransformerEncoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
(linear_values): Linear(in_features=1024, out_features=1024, bias=True)
(linear_query): Linear(in_features=1024, out_features=1024, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=1024, out_features=4096, bias=True)
(w_2): Linear(in_features=4096, out_features=1024, bias=True)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
)
)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
)
(decoder): TransformerDecoder(
(embeddings): Embeddings(
(make_embedding): Sequential(
(emb_luts): Elementwise(
(0): Embedding(256208, 1024, padding_idx=1)
)
(pe): PositionalEncoding()
)
(dropout): Dropout(p=0.1, inplace=False)
)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(transformer_layers): ModuleList(
(0): TransformerDecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
(linear_values): Linear(in_features=1024, out_features=1024, bias=True)
(linear_query): Linear(in_features=1024, out_features=1024, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=1024, out_features=4096, bias=True)
(w_2): Linear(in_features=4096, out_features=1024, bias=True)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm_1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(drop): Dropout(p=0.1, inplace=False)
(context_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
(linear_values): Linear(in_features=1024, out_features=1024, bias=True)
(linear_query): Linear(in_features=1024, out_features=1024, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(layer_norm_2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
)
ā¦
(11): TransformerDecoderLayer(
(self_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
(linear_values): Linear(in_features=1024, out_features=1024, bias=True)
(linear_query): Linear(in_features=1024, out_features=1024, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(feed_forward): PositionwiseFeedForward(
(w_1): Linear(in_features=1024, out_features=4096, bias=True)
(w_2): Linear(in_features=4096, out_features=1024, bias=True)
(layer_norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(dropout_1): Dropout(p=0.1, inplace=False)
(dropout_2): Dropout(p=0.1, inplace=False)
)
(layer_norm_1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(drop): Dropout(p=0.1, inplace=False)
(context_attn): MultiHeadedAttention(
(linear_keys): Linear(in_features=1024, out_features=1024, bias=True)
(linear_values): Linear(in_features=1024, out_features=1024, bias=True)
(linear_query): Linear(in_features=1024, out_features=1024, bias=True)
(softmax): Softmax(dim=-1)
(dropout): Dropout(p=0.1, inplace=False)
(final_linear): Linear(in_features=1024, out_features=1024, bias=True)
)
(layer_norm_2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
)
)
)
(generator): Linear(in_features=1024, out_features=256208, bias=True)
)
[2023-02-22 08:40:29,160 INFO] encoder: 413513728
[2023-02-22 08:40:29,160 INFO] decoder: 201818320
[2023-02-22 08:40:29,160 INFO] * number of parameters: 615332048
[2023-02-22 08:40:29,160 INFO] * src vocab size = 256208
[2023-02-22 08:40:29,160 INFO] * tgt vocab size = 256208
[2023-02-22 08:40:29,163 INFO] Get suffix for corpus_1: {āsrcā: ā eng_Latnā, ātgtā: āā}
[2023-02-22 08:40:29,369 INFO] Get prefix for corpus_1: {āsrcā: āā, ātgtā: āspa_Latnā}
[2023-02-22 08:40:29,369 INFO] Get prefix for src infer:
[2023-02-22 08:40:29,369 INFO] Get prefix for tgt infer:
[2023-02-22 08:40:29,369 INFO] Get suffix for corpus_1: {āsrcā: ā eng_Latnā, ātgtā: āā}
[2023-02-22 08:40:29,546 INFO] Get prefix for corpus_1: {āsrcā: āā, ātgtā: āspa_Latnā}
[2023-02-22 08:40:29,546 INFO] Get prefix for src infer:
[2023-02-22 08:40:29,546 INFO] Get prefix for tgt infer:
[2023-02-22 08:40:29,676 INFO] Starting training on GPU: [0]
[2023-02-22 08:40:29,676 INFO] Start training loop without validationā¦
[2023-02-22 08:40:29,676 INFO] Scoring with: TransformPipe()
[2023-02-22 08:41:46,168 INFO] Step 1/ 2500; acc: 1.3; ppl: 102504.3; xent: 11.5; lr: 0.05000; sents: 268; bsz: 727/ 920/22; 114/144 tok/s; 76 sec;
[2023-02-22 08:41:49,316 INFO] Step 2/ 2500; acc: 3.6; ppl: 9609.7; xent: 9.2; lr: 0.05000; sents: 284; bsz: 727/ 928/24; 2770/3538 tok/s; 80 sec;
[2023-02-22 08:41:52,439 INFO] Step 3/ 2500; acc: 4.2; ppl: 6028.7; xent: 8.7; lr: 0.05000; sents: 292; bsz: 726/ 918/24; 2789/3529 tok/s; 83 sec;
[2023-02-22 08:41:55,574 INFO] Step 4/ 2500; acc: 5.0; ppl: 4079.2; xent: 8.3; lr: 0.05000; sents: 296; bsz: 730/ 922/25; 2796/3529 tok/s; 86 sec;
[2023-02-22 08:41:58,978 INFO] Step 5/ 2500; acc: 7.8; ppl: 1338.7; xent: 7.2; lr: 0.05000; sents: 284; bsz: 744/ 937/24; 2624/3303 tok/s; 89 sec;
[2023-02-22 08:42:02,087 INFO] Step 6/ 2500; acc: 11.9; ppl: 687.4; xent: 6.5; lr: 0.05000; sents: 280; bsz: 728/ 920/23; 2811/3551 tok/s; 92 sec;
[2023-02-22 08:42:05,226 INFO] Step 7/ 2500; acc: 18.3; ppl: 363.5; xent: 5.9; lr: 0.05000; sents: 280; bsz: 730/ 934/23; 2793/3572 tok/s; 96 sec;
[2023-02-22 08:42:08,374 INFO] Step 8/ 2500; acc: 24.1; ppl: 243.5; xent: 5.5; lr: 0.05000; sents: 316; bsz: 738/ 922/26; 2813/3515 tok/s; 99 sec;
[2023-02-22 08:42:11,476 INFO] Step 9/ 2500; acc: 24.1; ppl: 233.5; xent: 5.5; lr: 0.05000; sents: 272; bsz: 725/ 912/23; 2805/3530 tok/s; 102 sec;
[2023-02-22 08:42:14,585 INFO] Step 10/ 2500; acc: 27.9; ppl: 186.3; xent: 5.2; lr: 0.05000; sents: 288; bsz: 733/ 926/24; 2829/3575 tok/s; 105 sec;
[2023-02-22 08:42:17,623 INFO] Step 11/ 2500; acc: 29.3; ppl: 162.3; xent: 5.1; lr: 0.05000; sents: 264; bsz: 719/ 908/22; 2842/3589 tok/s; 108 sec;