BLEU decreases so much after averaging checkpoints


I am training a zh-en TransformerBig model with OpenNMT-tf 2.22.0 on a corpus of 55M sentences. The batch size is 2048, effective_batch_size is 25000, save_checkpoints_steps is 1000 and average_last_checkpoints is 6.

After 41K steps, I average checkpoints and evaluate the performance of averaged checkpoint and last checkpoint on WMT20 zh-en test sets. I use sacrebleu for BLEU computation. BLEU of the last checkpoint is 30.1, but the averaged checkpoint has a BLEU of 15.7. It decreases so much after averaging checkpoings! It is almost same on WMT18 and 19 zh-en test sets.

Anybody can tell me the reason?


I check the config file used by averaged checkpoint, and find that I set some parameters in it, whic are not set in config file for last checkpoint:

beam_width: 5
length_penalty: 0.2
coverage_penalty: 0.2
replace_unknown_target: false

Is this the reason for that decrease?


The different configuration is most likely the cause of the different score.

Can you evaluate again using the same parameters?

Thank you for your quick reply!

I evaluate the averaged checkpoint with the same config used by the last checkpoint, and BLEU increase to 31.

Could you please tell me why the difference is so big?

You can easily find which parameter is causing the BLEU drop. I’m pretty sure it’s because of the coverage penalty which typically does not work with Transformer models.

Thank you for your valuable tips!
I will try it later.