Resume training alternatives


(Vincent Nguyen) #1

This topic was discussed numerous times, especially here Resume training - various options

I would like to simplify the resume training configuration.

The principle would be to eliminate the “-continue” option.

When using the “-train_from” then all settings would be loaded from the checkpoint BUT all other command line taken from the train.lua options will modify the settings from the checkpoint. if one option cannot be modified then die with error.

What do you think ?

The rationale behind this is that at the moment, this is very confusing and there could be some mistakes when using -continue with some modified settings like learning_rate, …


(Guillaume Klein) #2

Mmh, I liked it at first but I have some concerns:

  • Doesn’t it make the retraining (current equivalent of -train_from without -continue) more error-prone instead? For example, if you want to start a new training from model_checkpoint.lua, you have to reset many options to their “default” value: -start_epoch, -start_iteration, -learning_rate, etc.
  • What about training states? When do you reuse the order of batches (continuing an intermediate model), non-SGD optimizers states and the random generator states? If no options were changed?
  • Do you also copy the batch size, GPU identifiers? More generally, are there exceptions?

(jean.senellart) #3

I am not sure: continue was initially meant for continuing an interrupted training so - you can relaunch the same command line with -continue - for that reason it does ignore some of the cmdline (or config file) parameters

-train_from is more generic and explicitly the command line is the reference.

isn’t train_from the generic approach? and we keep -continue only for continuing cases?


(Vincent Nguyen) #4

that’s my point, I think it’s better to have an All or Nothing for clarity.

Nothing = -continue = resume interrupted training in the exact same state where it was left

All = -train_from, you have to be specific with all parameters to be modified. The only new stuff that could be introduced
is to allow to retrieve states too (which is possible now)

For non-SGD, I think the beginning would be to make them work :slight_smile:


(Guillaume Klein) #5

Feedback from people with retraining experience would be appreciated. @dbl @emartinezVic @Etienne38.

@vince62s Non-SGD methods actually work in the sense they are numerically correct. If they don’t perform well on the NMT tasks that is another issue.


(David Landan) #6

I am happy with the current -train_from and -continue options. The first time I read them, it was a little unclear, but I think that’s more of a documentation than implementation issue. :wink:

In practice, I find the current options both sufficient and straightforward.


(Eva) #7

I see it as @dbl .
I think it is a good idea to maintain both options, -continue and -train_from, but clarifying their documentation because it can be confusing at first reading.


(Etienne Monneret) #8

I agree with @dbl and @emartinezVic : both options are useful, but need perhaps clearer explanations.

continue : take all from the saved model, most parameters can be changed on the command line, like epochs, learning rate, fixed embeddings, …

train_from : take all from the command line, not-specified parameters are the ONMT default ones not the model saved values (except for few mandatory ones, of course, like the net size or the embeddings).

In my own current experiments, being not really sure of what is doing the “continue” option, I’m always using the “train_from” option. What is really lacking in such usage cases : a full report of all used parameters values on the log file at start ! It’s the only way one can be sure of values ONMT is using when running. This full report can also be an option.


(Vincent Nguyen) #9

@Etienne38
This is my point:

  1. continue and train_from are not mutually exclusive. “continue” is an additional option to train_from.

Also, as you, I used most of the time without continue, BUT note that in fact this way of doing things leads to wring results beacuse the random generator states are not retrieved, and therefore each new run (without continue) will not shuffle the batch or randomize the dropout noise.

Anyway as long as everyone knows what they are doing it’s fine to me.


(Etienne Monneret) #10

Yes. Of course. I just made 2 cases, to highlight 2 different usage configurations. In fact, with or without “continue” option.

Perhaps a small code tuning would be to really have 2 different exclusive options (with an error when both used):
-train_from MODEL
-continue MODEL

I’m often a bit puzzled but what ONMT is really doing with my parameters, mixing all possible options from the model, the command line, and the default values. As you said, perhaps a prior need is to list all effects of both situations, in fine details, somewhere on a doc page. As I said, the second need, is to list all parameters really used by ONMT on the log at each start.


(David Landan) #11

A json or yaml file for each epoch listing all parameters for that epoch would be handy, and probably not too much effort. I’d very much like to see that.


(Guillaume Klein) #12

Interesting thread.

As a starter, the training is now dumping training options when using -log_level DEBUG.


(jean.senellart) #13

to summarize the issue:

  • @vince62s reported that there is a problem with the random number generator - today, it is saved and reused when using -continue only. it is a problem since when not using continue - for instance for a epoch-per-epoch training, the random generator state is always the same at the beginning and we lose the randomness of batches (https://github.com/OpenNMT/OpenNMT/issues/188). Note that there will be an issue then with the -seed option that will then be ignored, so I propose to have a bit more sophisticated rule: if -seed value does not change, we keep the last state of the rng when doing a -train_from, otherwise we take the new value

  • -continue is a bit non intuitive since for instance if we do -continue and -learning_rate X, the learning rate is ignored without warning while, when we change some other parameters, it is taken into account. I propose at least a clear message saying that the cmdline parameter is ignored

@vince62s, would it work for you?


(Vincent Nguyen) #14

works for me.
thanks


(jean.senellart) #15

See:

  • if cmdline changes seed compared to previous saved model, then it has priority on rng states
  • if cmdline changes learning_rate or start_epoch then it has also priority on -continue option

so - -continue is really meant to continue a previous interrupted training. And w/ seed validation, we guarantee more stability on a given starting point but still with randomness.