Resume training can be useful in various contexts:
Pursue an existing training with the same parameters for a few more epochs
Pursue a training with new data for in-domain adaptation or incremental training
Change some settings between two runs
The first parameter that will trigger a training resume is: -train_from
At this point some parameters describing the topology will be loaded from the model itself, hence making useless these in the command line: layers, rnn_size, brnn, brnn_merge, input_feed,
[for developpers: you may want to raise a warning / error if command line include train_from and one of these options]
Then there are two options:
you want to keep the training setting. You need to set “-continue”. In this mode, the following parameters will be loaded from the model checkpoint itself: start_epoch, start_iteration, learning_rate, learning_rate_decay, start_decay_at, optim, optim_state and curriculum
You can change the data file for incremental / in-domain adaptation, plus set the end_epoch parameter.
[for developpers: you may want to raise a warning / error if command line includes these preloaded options ]
If you want to change the training setting (learning rate, decay, …) you need to start a new training curve.
In this case, you need to be specific with:
start_epoch ===> @guillaumekln [no check here versus last epoch run, right ?]
start_iteration, learning_rate, learning_rate_decay, start_decay_at, optim, optim_state and curriculum
end_epoch
I am unclear (but will EDIT when I know it] about some other options like:
max_batch_size
dropout
and a few other ones.
Thank you for this overview that needs to be maintained over time. Retraining can be ambiguous depending on the user’s goal: changing the training configuration vs. resuming a training that has stopped.
Yes, no check is done when -start_epoch is used. If used, it is a way for the user to simulate the continuity of the training but possibly with other options.
Also, changing -max_batch_size is another use case of retraining. You could start one epoch with a large batch size to quickly converge to a not so bad solution, then reduce the batch size for a more fine grain training. It can be changed with or without -continue.
I wonder why such 2 different cases ? I would have find much more intuitive to always use the “continue” option, and then, all other options defined on the command line overwrite options loaded from the model. Is this not the case ?
@guillaumekln: Looks like the link above does not work. BTW, I was wondering if there is a chance to re-use the OpenNMT pre-trained models. If there is a way, could you please let know what is the procedure if I already have tokenized and processed the in-domain corpus? If not, what is the best way?
Note I checked Retraining documentation page, and except for the training a model on new data (incremental adaptation) mention, there are no hints on how to run this incremental adaptation training. Is it enough to just run, say