If I’m correct, training data is shuffled in both onmt-py and onmt-tf. My question concerns further training: assume that you are training a model but for some reason you have to stop training or the training process crashes. Onmt provides the option to continue training from the last checkpoint, which is nice. However, ideally, this would mean that it continues training with data it had not seen before (or at least which it had not seen as frequently as other items). My question is whether this is the case, as currently implemented? If not, that may give unexpected results (e.g. having seen some samples twice and others never). To be fair, I am not sure whether this is even possible or hard to implement. I am interested in both py and tf versions.
Hey there,
For reference, a similar question has already been discussed a bit for -py
here: "Train from" - choosing preprocess chunk
To make this transparent, we could probably store some dict with {<dataset_name>: <current_shard_number>}
in the checkpoint. We’d gladly accept a PR for such feature if you feel like diving into it.
Hey @BramVanroy
I opened a PR doing approximately what I described before: https://github.com/OpenNMT/OpenNMT-py/pull/1826
Would be great if you could checkout this branch and test it in your setup.
Cool! At the moment I don’t have a lot of time to test it, though, but I’ll put it on my calendar!
@BramVanroy from Github PR:
You seem to track the latest shard that has been used during training and also save it in the checkpoints. When continuing training, the latest shard can then be used.
@francoishernandez Is this kind of tracking implemented with OpenNMT-tf?