Incremental learning(In-domain adaptation, Retraining) in Pytorch version OpenNMT

I’m trying to do incremental learning with pytorch version OpenNMT for English to Chinese translation. Although there are many questions asked regarding this topic, most of them are for the torch version or provided solution based on torch version.

Here is some basic info:

  • Task: English to Chinese
  • OpenNMT version: Pytorch
  • Architecture: Transformer
  • Generic data size: 10 million pairs
  • In-domain data size: 4k pairs
  • Granularity: Subword units with BPE on both sides
  • Vocab size: 45571 for En, 32232 for Ch.

Here are the three scenarios for incremental learning:

  1. Retraining a pre-trained model on NEW data with SAME training options for in-domain adaptation.
  2. Retraining a pre-trained model on NEW data with DIFFERENT training options for in-domain adaptation.
  3. Continuing a stopped or complete training on SAME data with SAME training options for more epochs.

To my understanding, with torch version OpenNMT, which provides full Retraining options: -train_from, -continue, -update_vocab, addressing parameters of model, hyper-parameters of training, and vocabulary of model, respectively, I can do it in each scenario by using:

  1. -train_from + -continue + -update_vocab merge
  2. -train_from + -update_vocab merge
  3. -train_from + -continue

However, in pytorch version OpenNMT, there seems only -train_from option. So how can I implement incremental learning in these three scenarios in OpenNMT-py?

Thank you,

Jeff

Hi,

OpenNMT-py does a “continue” by default. To change some learning parameters, see the option -reset_optim.

The vocabulary update is not supported though, you’ll have to do without.

Thanks for replying @guillaumekln.
Since a model’s topology and vocabularies can’t be changed, when incorporating new data to an existing model, what is the best practice to preprocess the new data in OpenNMT-py? Using old vocabulary(old.vocab.pt) with -share_vocab option for new data to generate new.training.pt and new.validation.pt files?
P.S. All data was pre-processed with same BPE model.

Dose anyone have ideas on how to preprocess the data when incorporating new data to existing models in OpenNMT-py? I appreciate if you can share any comments, thoughts, or experience regarding this topic.

Did you try the -src_vocab and -tgt_vocab options to reuse existing vocabularies when preprocessing new data?

Thank you for your reply. I’m a bit confused - since there is only one vocab.pt file generated from preprocessing in OpenNMT-py, does that mean src_vocab and tgt-vocab can be specified to the same file?

@vince62s What is the recommended way to reuse existing vocabularies?

I have not tested it, but IIRC

preprocess.py the new data with -src_vocab oldvocab.pt -share_vocab
let me know if this works, if not we will fix it.

Hi Vincent! What about the -tgt_vocab? o the same oldvocab.pt and then -share_vocab?

in theory you should not need -tgt_vocab, it will be deducted from -share_vocab

Thanks! I will try what you recommended and let you know.

@Jeff - Hi, did you try out what @vince62s recommended? Did it work?

I looked at the code again and my suggestion was wrong.

Currently it only support src_vocab and tgt_vocab files that are text files, one line per word. We need to add th option to load also .pt files.