I would like to simplify the resume training configuration.
The principle would be to eliminate the “-continue” option.
When using the “-train_from” then all settings would be loaded from the checkpoint BUT all other command line taken from the train.lua options will modify the settings from the checkpoint. if one option cannot be modified then die with error.
What do you think ?
The rationale behind this is that at the moment, this is very confusing and there could be some mistakes when using -continue with some modified settings like learning_rate, …
Mmh, I liked it at first but I have some concerns:
Doesn’t it make the retraining (current equivalent of -train_from without -continue) more error-prone instead? For example, if you want to start a new training from model_checkpoint.lua, you have to reset many options to their “default” value: -start_epoch, -start_iteration, -learning_rate, etc.
What about training states? When do you reuse the order of batches (continuing an intermediate model), non-SGD optimizers states and the random generator states? If no options were changed?
Do you also copy the batch size, GPU identifiers? More generally, are there exceptions?
I am not sure: continue was initially meant for continuing an interrupted training so - you can relaunch the same command line with -continue - for that reason it does ignore some of the cmdline (or config file) parameters
-train_from is more generic and explicitly the command line is the reference.
isn’t train_from the generic approach? and we keep -continue only for continuing cases?
I agree with @dbl and @emartinezVic : both options are useful, but need perhaps clearer explanations.
continue : take all from the saved model, most parameters can be changed on the command line, like epochs, learning rate, fixed embeddings, …
train_from : take all from the command line, not-specified parameters are the ONMT default ones not the model saved values (except for few mandatory ones, of course, like the net size or the embeddings).
In my own current experiments, being not really sure of what is doing the “continue” option, I’m always using the “train_from” option. What is really lacking in such usage cases : a full report of all used parameters values on the log file at start ! It’s the only way one can be sure of values ONMT is using when running. This full report can also be an option.
continue and train_from are not mutually exclusive. “continue” is an additional option to train_from.
Also, as you, I used most of the time without continue, BUT note that in fact this way of doing things leads to wring results beacuse the random generator states are not retrieved, and therefore each new run (without continue) will not shuffle the batch or randomize the dropout noise.
Anyway as long as everyone knows what they are doing it’s fine to me.
Yes. Of course. I just made 2 cases, to highlight 2 different usage configurations. In fact, with or without “continue” option.
Perhaps a small code tuning would be to really have 2 different exclusive options (with an error when both used):
I’m often a bit puzzled but what ONMT is really doing with my parameters, mixing all possible options from the model, the command line, and the default values. As you said, perhaps a prior need is to list all effects of both situations, in fine details, somewhere on a doc page. As I said, the second need, is to list all parameters really used by ONMT on the log at each start.
@vince62s reported that there is a problem with the random number generator - today, it is saved and reused when using -continue only. it is a problem since when not using continue - for instance for a epoch-per-epoch training, the random generator state is always the same at the beginning and we lose the randomness of batches (https://github.com/OpenNMT/OpenNMT/issues/188). Note that there will be an issue then with the -seed option that will then be ignored, so I propose to have a bit more sophisticated rule: if -seed value does not change, we keep the last state of the rng when doing a -train_from, otherwise we take the new value
-continue is a bit non intuitive since for instance if we do -continue and -learning_rate X, the learning rate is ignored without warning while, when we change some other parameters, it is taken into account. I propose at least a clear message saying that the cmdline parameter is ignored