How can I incrementally train my data?

leokonst · March 14, 2020, 8:00pm

Dear folks,

I am relatively new in open nmt.
How can I incrementally train my data?
Suppose I have trained 10 million data resulting in say model0.pt.
Now I want to add 10 million new train data.
After a preprocess on the 10 million new train data.

Now with command:

onmt_train -data data/newtraindata -save_model model1 -train_from model0.pt

will the new model1.pt be based on training a total of 20 miljon data?
In other words, is the train_from parameter the right one for incremental training?
Can anyone give me the guarantee that model1 will be the same as if you have trained from start with 20 million data.

Bachstelze · March 17, 2020, 5:19pm

Hey @leokonst,

-train_from modelx.pt

is the right parameter for (periodic) updates of your model weights. You can wash out your old trained weights with a complete new training set. Therefore, you should train on the combined training data to avoid a catastrophic forgetting. By the way, iterative back-translation is a good approach to increase the quality.

Can anyone give me the guarantee that model1 will be the same as if you have trained from start with 20 million data.

It is likely that the models will make the same outputs, but you will never get the exactly same model weights if you start from new random model weights.

leokonst · March 17, 2020, 9:56pm

Thank you Bachstelze for your reply. You warn met that I can wash out my old trained weights with a new training set. I should train on the combined training data. Do you mean I have to train again with 20 million train data?