I’m using opennmt-tf. My problem is that I have 500000000 en-ch data and I set “sample_buffer_size” to 100000000. However, if interrupt, the Bleu will drop after the training continues from the last checkpoint, about 5 point , e.g. bleu 47 -> 42, any idea about this phenomenon?
In this case, the data is first split by chunks of size 100000000. The chunks are visited in random order and the examples are uniformly shuffled within each chunk.
Depending on how your training file is generated, it can happen that the first loaded chunk is from a different domain or style than your test set. This can hurt the BLEU score in early iterations.
You can try one of the following:
Manually shuffle the training data on disk before starting the training
Do not concatenate the training files and use weighted datasets (see the documentation): in this mode, each batch could contain examples from different domain.