Changing the behaviour of `end_epoch` options when used in combination with `train_from` and `continue` options

Using train_from and continue options allows to continue from an existing model.
The latest epoch number is retrieved, so there’s no need to specify it with start_epoch option. However, we need to specify the final epoch until which we want to continue the training. We can use the end_epoch option for that.
Now, let’s say I want to train n many more epochs. I have to know what is the latest epoch number. It would be useful if, when continue option is present, the end_epoch would be interpreted as “this many more epochs”.

What do you think about this proposition?

In fact, reading the changelog, it seems that end_epoch used to be named epochs. I don’t know what would be the clearer, add a new option to tell how many epochs we want to train from the start or, change the behaviour of end_epoch when used with continue ?

I think we should add a new option epochs to avoid confusion. If the value is > 0, it takes priority over end_epoch.

A training epoch by epoch would look like this:

th train.lua [...] -epochs 1
th train.lua [...] -train_from model.t7 -continue -epochs 1
th train.lua [...] -train_from model.t7 -continue -epochs 1


I would like to take this opportunity to re question the epoch concept and may try to move to “iteration” or “steps” like most other projects do.
Just saying.


With the data (or file) sampling, we actually changed the definition of an epoch from a pass over the whole dataset to a number of steps after which to perform evaluation and learning rate updates. So in a way, we support both the epoch and step worlds.

But I agree that steps only is clearer and easier to manage. We don’t plan to make a full switch in OpenNMT (Lua), it could happen elsewhere though. :wink: