Hi! I trained a base transformer model (twice) and got 8.7 and 7.6 BLEU scores. I was hoping to get somewhere in the 30s.
DE → EN
- Learnt BPE on shared train sets. Applied BPE on train, valid and test sets using scripting. In essence, not using transforms. Using
omnt_build_vocab
to generate a shared vocabulary of size 32k. - Trained using base model configuration using noam upto 100,000 steps.
- Using ensemble for translations
- Converting bpe subwords translations → tokenized translations → detokenized translations using scripts
- Sacrebleu for scoring between test.eng and output from step 4
Training size: 6.4 M sentences
Test: 2k
Validation: 3k
Any insights on what could be missing / going wrong? I compared the generated vocabulary with the generated one using different frameworks and it looks very similar. When I manually inspect the translations, they are pretty bad.
Here is an excerpt from training -
[2021-05-16 21:47:40,552 INFO] Step 99600/100000; acc: 52.30; ppl: 7.78; xent: 2.05; lr: 0.00028; 9769/9721 tok/s; 32927 sec
[2021-05-16 21:48:12,909 INFO] Step 99700/100000; acc: 54.08; ppl: 7.22; xent: 1.98; lr: 0.00028; 9210/9569 tok/s; 32960 sec
[2021-05-16 21:48:46,783 INFO] Step 99800/100000; acc: 52.84; ppl: 7.63; xent: 2.03; lr: 0.00028; 9118/9218 tok/s; 32993 sec
[2021-05-16 21:49:19,891 INFO] Step 99900/100000; acc: 53.60; ppl: 7.24; xent: 1.98; lr: 0.00028; 9905/10313 tok/s; 33027 sec
[2021-05-16 21:49:52,893 INFO] Step 100000/100000; acc: 54.28; ppl: 7.06; xent: 1.95; lr: 0.00028; 10062/10160 tok/s; 33060 sec
[2021-05-16 21:49:52,896 INFO] Loading ParallelCorpus(data-deen/bpe/valid.bpe.deu, data-deen/bpe/valid.bpe.eng, align=None)...
[2021-05-16 21:49:59,077 INFO] Validation perplexity: 32.8629
[2021-05-16 21:49:59,077 INFO] Validation accuracy: 42.1505
[2021-05-16 21:49:59,162 INFO] Saving checkpoint checkpoints/model_step_100000.pt