"sample_buffer_size" is set, every time if the training is interrupted, the Bleu value will drop after the training continues

I’m using opennmt-tf. My problem is that I have 500000000 en-ch data and I set “sample_buffer_size” to 100000000. However, if interrupt, the Bleu will drop after the training continues from the last checkpoint, about 5 point , e.g. bleu 47 -> 42, any idea about this phenomenon?

Thanks a lot.

@guillaumekln

In this case, the data is first split by chunks of size 100000000. The chunks are visited in random order and the examples are uniformly shuffled within each chunk.

Depending on how your training file is generated, it can happen that the first loaded chunk is from a different domain or style than your test set. This can hurt the BLEU score in early iterations.

You can try one of the following:

  • Manually shuffle the training data on disk before starting the training
  • Do not concatenate the training files and use weighted datasets (see the documentation): in this mode, each batch could contain examples from different domain.

That’s the way we do it, so I have no idea about this phenomenon. Anyway, thank you for your reply. I’ll check the data again.

Best regards.