Incremental learning(In-domain adaptation, Retraining) in Pytorch version OpenNMT

Jeff · January 31, 2019, 9:47pm

I’m trying to do incremental learning with pytorch version OpenNMT for English to Chinese translation. Although there are many questions asked regarding this topic, most of them are for the torch version or provided solution based on torch version.

Here is some basic info:

Task: English to Chinese
OpenNMT version: Pytorch
Architecture: Transformer
Generic data size: 10 million pairs
In-domain data size: 4k pairs
Granularity: Subword units with BPE on both sides
Vocab size: 45571 for En, 32232 for Ch.

Here are the three scenarios for incremental learning:

Retraining a pre-trained model on NEW data with SAME training options for in-domain adaptation.
Retraining a pre-trained model on NEW data with DIFFERENT training options for in-domain adaptation.
Continuing a stopped or complete training on SAME data with SAME training options for more epochs.

To my understanding, with torch version OpenNMT, which provides full Retraining options: -train_from, -continue, -update_vocab, addressing parameters of model, hyper-parameters of training, and vocabulary of model, respectively, I can do it in each scenario by using:

-train_from + -continue + -update_vocab merge
-train_from + -update_vocab merge
-train_from + -continue

However, in pytorch version OpenNMT, there seems only -train_from option. So how can I implement incremental learning in these three scenarios in OpenNMT-py?

Thank you,

Jeff

guillaumekln · February 1, 2019, 8:53am

Hi,

OpenNMT-py does a “continue” by default. To change some learning parameters, see the option -reset_optim.

The vocabulary update is not supported though, you’ll have to do without.

Jeff · February 1, 2019, 4:31pm

Thanks for replying @guillaumekln.
Since a model’s topology and vocabularies can’t be changed, when incorporating new data to an existing model, what is the best practice to preprocess the new data in OpenNMT-py? Using old vocabulary(old.vocab.pt) with -share_vocab option for new data to generate new.training.pt and new.validation.pt files?
P.S. All data was pre-processed with same BPE model.

Jeff · February 3, 2019, 4:56am

Dose anyone have ideas on how to preprocess the data when incorporating new data to existing models in OpenNMT-py? I appreciate if you can share any comments, thoughts, or experience regarding this topic.

guillaumekln · February 3, 2019, 10:07am

Did you try the -src_vocab and -tgt_vocab options to reuse existing vocabularies when preprocessing new data?

Jeff · February 3, 2019, 12:22pm

Thank you for your reply. I’m a bit confused - since there is only one vocab.pt file generated from preprocessing in OpenNMT-py, does that mean src_vocab and tgt-vocab can be specified to the same file?

guillaumekln · February 3, 2019, 12:38pm

@vince62s What is the recommended way to reuse existing vocabularies?

vince62s · February 3, 2019, 1:14pm

I have not tested it, but IIRC

preprocess.py the new data with -src_vocab oldvocab.pt -share_vocab
let me know if this works, if not we will fix it.

Jeff · February 3, 2019, 1:29pm

Hi Vincent! What about the -tgt_vocab? o the same oldvocab.pt and then -share_vocab?

vince62s · February 3, 2019, 1:48pm

in theory you should not need -tgt_vocab, it will be deducted from -share_vocab

Jeff · February 3, 2019, 2:05pm

Thanks! I will try what you recommended and let you know.

add · February 11, 2019, 1:32pm

@Jeff - Hi, did you try out what @vince62s recommended? Did it work?

vince62s · February 11, 2019, 6:56pm

I looked at the code again and my suggestion was wrong.

Currently it only support src_vocab and tgt_vocab files that are text files, one line per word. We need to add th option to load also .pt files.