I am training a zh-en TransformerBig model with OpenNMT-tf 2.22.0 on a corpus of 55M sentences. The batch size is 2048, effective_batch_size is 25000, save_checkpoints_steps is 1000 and average_last_checkpoints is 6.
After 41K steps, I average checkpoints and evaluate the performance of averaged checkpoint and last checkpoint on WMT20 zh-en test sets. I use sacrebleu for BLEU computation. BLEU of the last checkpoint is 30.1, but the averaged checkpoint has a BLEU of 15.7. It decreases so much after averaging checkpoings! It is almost same on WMT18 and 19 zh-en test sets.
You can easily find which parameter is causing the BLEU drop. I’m pretty sure it’s because of the coverage penalty which typically does not work with Transformer models.