Resume training - various options

Resume training can be useful in various contexts:

  • Pursue an existing training with the same parameters for a few more epochs
  • Pursue a training with new data for in-domain adaptation or incremental training
  • Change some settings between two runs

The first parameter that will trigger a training resume is: -train_from

At this point some parameters describing the topology will be loaded from the model itself, hence making useless these in the command line: layers, rnn_size, brnn, brnn_merge, input_feed,

[for developpers: you may want to raise a warning / error if command line include train_from and one of these options]

Then there are two options:

  1. you want to keep the training setting. You need to set “-continue”. In this mode, the following parameters will be loaded from the model checkpoint itself: start_epoch, start_iteration, learning_rate, learning_rate_decay, start_decay_at, optim, optim_state and curriculum
    You can change the data file for incremental / in-domain adaptation, plus set the end_epoch parameter.

[for developpers: you may want to raise a warning / error if command line includes these preloaded options ]

  1. If you want to change the training setting (learning rate, decay, …) you need to start a new training curve.
    In this case, you need to be specific with:
    start_epoch ===> @guillaumekln [no check here versus last epoch run, right ?]
    start_iteration, learning_rate, learning_rate_decay, start_decay_at, optim, optim_state and curriculum

I am unclear (but will EDIT when I know it] about some other options like:
and a few other ones.

Thank you for this overview that needs to be maintained over time. Retraining can be ambiguous depending on the user’s goal: changing the training configuration vs. resuming a training that has stopped.

Yes, no check is done when -start_epoch is used. If used, it is a way for the user to simulate the continuity of the training but possibly with other options.

Also, changing -max_batch_size is another use case of retraining. You could start one epoch with a large batch size to quickly converge to a not so bad solution, then reduce the batch size for a more fine grain training. It can be changed with or without -continue.

Dropout is currently fixed inside the model.

Let’s add this to the guide.

I wonder why such 2 different cases ? I would have find much more intuitive to always use the “continue” option, and then, all other options defined on the command line overwrite options loaded from the model. Is this not the case ?

Due to torch.CmdLine design, we can’t determine whether an option was given on the command line or not.

I completed the guide with this:

@guillaumekln: Looks like the link above does not work. BTW, I was wondering if there is a chance to re-use the OpenNMT pre-trained models. If there is a way, could you please let know what is the procedure if I already have tokenized and processed the in-domain corpus? If not, what is the best way?

Note I checked Retraining documentation page, and except for the training a model on new data (incremental adaptation) mention, there are no hints on how to run this incremental adaptation training. Is it enough to just run, say

th train.lua -gpuid 1 -data data/INDOMAIN_CORPUS.t7 -save_model IN_DOMAIN_MODEL -save_every 1000 -train_from onmt_baseline_wmt15-all.en-de_epoch13_7.19_release.t7

Thank you.

  • The link above is now the Retraining page you referenced.
  • Pre-trained models can only be used for translation
  • Incremental adaptation has been discussed many times on the forum, see for example:
1 Like

A post was merged into an existing topic: Problem with incremental/in-domain training