OpenNMT Forum

Incremental learning(In-domain adaptation, Retraining) in Pytorch version OpenNMT


(Jeff Wang) #1

I’m trying to do incremental learning with pytorch version OpenNMT for English to Chinese translation. Although there are many questions asked regarding this topic, most of them are for the torch version or provided solution based on torch version.

Here is some basic info:

  • Task: English to Chinese
  • OpenNMT version: Pytorch
  • Architecture: Transformer
  • Generic data size: 10 million pairs
  • In-domain data size: 4k pairs
  • Granularity: Subword units with BPE on both sides
  • Vocab size: 45571 for En, 32232 for Ch.

Here are the three scenarios for incremental learning:

  1. Retraining a pre-trained model on NEW data with SAME training options for in-domain adaptation.
  2. Retraining a pre-trained model on NEW data with DIFFERENT training options for in-domain adaptation.
  3. Continuing a stopped or complete training on SAME data with SAME training options for more epochs.

To my understanding, with torch version OpenNMT, which provides full Retraining options: -train_from, -continue, -update_vocab, addressing parameters of model, hyper-parameters of training, and vocabulary of model, respectively, I can do it in each scenario by using:

  1. -train_from + -continue + -update_vocab merge
  2. -train_from + -update_vocab merge
  3. -train_from + -continue

However, in pytorch version OpenNMT, there seems only -train_from option. So how can I implement incremental learning in these three scenarios in OpenNMT-py?

Thank you,


(Guillaume Klein) #2


OpenNMT-py does a “continue” by default. To change some learning parameters, see the option -reset_optim.

The vocabulary update is not supported though, you’ll have to do without.

(Jeff Wang) #3

Thanks for replying @guillaumekln.
Since a model’s topology and vocabularies can’t be changed, when incorporating new data to an existing model, what is the best practice to preprocess the new data in OpenNMT-py? Using old vocabulary( with -share_vocab option for new data to generate and files?
P.S. All data was pre-processed with same BPE model.

(Jeff Wang) #4

Dose anyone have ideas on how to preprocess the data when incorporating new data to existing models in OpenNMT-py? I appreciate if you can share any comments, thoughts, or experience regarding this topic.

(Guillaume Klein) #5

Did you try the -src_vocab and -tgt_vocab options to reuse existing vocabularies when preprocessing new data?

(Jeff Wang) #6

Thank you for your reply. I’m a bit confused - since there is only one file generated from preprocessing in OpenNMT-py, does that mean src_vocab and tgt-vocab can be specified to the same file?

(Guillaume Klein) #7

@vince62s What is the recommended way to reuse existing vocabularies?

(Vincent Nguyen) #8

I have not tested it, but IIRC the new data with -src_vocab -share_vocab
let me know if this works, if not we will fix it.

(Jeff Wang) #9

Hi Vincent! What about the -tgt_vocab? o the same and then -share_vocab?

(Vincent Nguyen) #10

in theory you should not need -tgt_vocab, it will be deducted from -share_vocab

(Jeff Wang) #11

Thanks! I will try what you recommended and let you know.


@Jeff - Hi, did you try out what @vince62s recommended? Did it work?

(Vincent Nguyen) #13

I looked at the code again and my suggestion was wrong.

Currently it only support src_vocab and tgt_vocab files that are text files, one line per word. We need to add th option to load also .pt files.