7.6 BLEU on Base Transfomer

Hi! I trained a base transformer model (twice) and got 8.7 and 7.6 BLEU scores. I was hoping to get somewhere in the 30s.


  1. Learnt BPE on shared train sets. Applied BPE on train, valid and test sets using scripting. In essence, not using transforms. Using omnt_build_vocab to generate a shared vocabulary of size 32k.
  2. Trained using base model configuration using noam upto 100,000 steps.
  3. Using ensemble for translations
  4. Converting bpe subwords translations → tokenized translations → detokenized translations using scripts
  5. Sacrebleu for scoring between test.eng and output from step 4

Training size: 6.4 M sentences
Test: 2k
Validation: 3k

Any insights on what could be missing / going wrong? I compared the generated vocabulary with the generated one using different frameworks and it looks very similar. When I manually inspect the translations, they are pretty bad.

Here is an excerpt from training -

[2021-05-16 21:47:40,552 INFO] Step 99600/100000; acc:  52.30; ppl:  7.78; xent: 2.05; lr: 0.00028; 9769/9721 tok/s;  32927 sec
[2021-05-16 21:48:12,909 INFO] Step 99700/100000; acc:  54.08; ppl:  7.22; xent: 1.98; lr: 0.00028; 9210/9569 tok/s;  32960 sec
[2021-05-16 21:48:46,783 INFO] Step 99800/100000; acc:  52.84; ppl:  7.63; xent: 2.03; lr: 0.00028; 9118/9218 tok/s;  32993 sec
[2021-05-16 21:49:19,891 INFO] Step 99900/100000; acc:  53.60; ppl:  7.24; xent: 1.98; lr: 0.00028; 9905/10313 tok/s;  33027 sec
[2021-05-16 21:49:52,893 INFO] Step 100000/100000; acc:  54.28; ppl:  7.06; xent: 1.95; lr: 0.00028; 10062/10160 tok/s;  33060 sec
[2021-05-16 21:49:52,896 INFO] Loading ParallelCorpus(data-deen/bpe/valid.bpe.deu, data-deen/bpe/valid.bpe.eng, align=None)...
[2021-05-16 21:49:59,077 INFO] Validation perplexity: 32.8629
[2021-05-16 21:49:59,077 INFO] Validation accuracy: 42.1505
[2021-05-16 21:49:59,162 INFO] Saving checkpoint checkpoints/model_step_100000.pt


The most important info about training the model is missing here. What is the size of your data? You need at least 500k to start seeing good results on a very similar test dataset; and at least a few millions to be able to say I trained a real model.

I hope this helps.

Kind regards,

Hi! Updated the post with the stats, but using about 6.4 M sentences for training :slight_smile:

Thanks for the info! After training for 100k steps on 6.4m sentences, I would expect a better accuracy and lower perplexity for both training and validation. Something might be wrong in the data.

Using ensemble for translations

How is the result without ensemble decoding?

Kind regards,