BLEU decreases so much after averaging checkpoints

liuxf · March 27, 2022, 2:10am

Hi,

I am training a zh-en TransformerBig model with OpenNMT-tf 2.22.0 on a corpus of 55M sentences. The batch size is 2048, effective_batch_size is 25000, save_checkpoints_steps is 1000 and average_last_checkpoints is 6.

After 41K steps, I average checkpoints and evaluate the performance of averaged checkpoint and last checkpoint on WMT20 zh-en test sets. I use sacrebleu for BLEU computation. BLEU of the last checkpoint is 30.1, but the averaged checkpoint has a BLEU of 15.7. It decreases so much after averaging checkpoings! It is almost same on WMT18 and 19 zh-en test sets.

Anybody can tell me the reason?

Thanks!

liuxf · March 27, 2022, 3:19am

I check the config file used by averaged checkpoint, and find that I set some parameters in it, whic are not set in config file for last checkpoint:

params:
beam_width: 5
length_penalty: 0.2
coverage_penalty: 0.2
replace_unknown_target: false

Is this the reason for that decrease?

guillaumekln · March 27, 2022, 8:35am

Hi,

The different configuration is most likely the cause of the different score.

Can you evaluate again using the same parameters?

liuxf · March 27, 2022, 9:29am

Thank you for your quick reply!

I evaluate the averaged checkpoint with the same config used by the last checkpoint, and BLEU increase to 31.

Could you please tell me why the difference is so big?

guillaumekln · March 27, 2022, 10:06am

You can easily find which parameter is causing the BLEU drop. I’m pretty sure it’s because of the coverage penalty which typically does not work with Transformer models.

liuxf · March 27, 2022, 10:13am

Thank you for your valuable tips!
I will try it later.