Does continuing training start from unseen/less-seen data?

BramVanroy · June 18, 2020, 8:46pm

If I’m correct, training data is shuffled in both onmt-py and onmt-tf. My question concerns further training: assume that you are training a model but for some reason you have to stop training or the training process crashes. Onmt provides the option to continue training from the last checkpoint, which is nice. However, ideally, this would mean that it continues training with data it had not seen before (or at least which it had not seen as frequently as other items). My question is whether this is the case, as currently implemented? If not, that may give unexpected results (e.g. having seen some samples twice and others never). To be fair, I am not sure whether this is even possible or hard to implement. I am interested in both py and tf versions.

francoishernandez · June 19, 2020, 8:34am

Hey there,
For reference, a similar question has already been discussed a bit for -py here: "Train from" - choosing preprocess chunk

To make this transparent, we could probably store some dict with {<dataset_name>: <current_shard_number>} in the checkpoint. We’d gladly accept a PR for such feature if you feel like diving into it.

francoishernandez · July 20, 2020, 5:14pm

Hey @BramVanroy
I opened a PR doing approximately what I described before: https://github.com/OpenNMT/OpenNMT-py/pull/1826
Would be great if you could checkout this branch and test it in your setup.

BramVanroy · July 23, 2020, 4:10pm

Cool! At the moment I don’t have a lot of time to test it, though, but I’ll put it on my calendar!

Nart · February 19, 2021, 9:36pm

@BramVanroy from Github PR:

You seem to track the latest shard that has been used during training and also save it in the checkpoints. When continuing training, the latest shard can then be used.

@francoishernandez Is this kind of tracking implemented with OpenNMT-tf?

francoishernandez · February 22, 2021, 9:59am

Not sure about this, I’ll let @guillaumekln answer in your other topic