I’m trying to do incremental learning with pytorch version OpenNMT for English to Chinese translation. Although there are many questions asked regarding this topic, most of them are for the torch version or provided solution based on torch version.
Here is some basic info:
Task: English to Chinese
OpenNMT version: Pytorch
Architecture: Transformer
Generic data size: 10 million pairs
In-domain data size: 4k pairs
Granularity: Subword units with BPE on both sides
Vocab size: 45571 for En, 32232 for Ch.
Here are the three scenarios for incremental learning:
Retraining a pre-trained model on NEW data with SAME training options for in-domain adaptation.
Retraining a pre-trained model on NEW data with DIFFERENT training options for in-domain adaptation.
Continuing a stopped or complete training on SAME data with SAME training options for more epochs.
To my understanding, with torch version OpenNMT, which provides full Retraining options: -train_from, -continue, -update_vocab, addressing parameters of model, hyper-parameters of training, and vocabulary of model, respectively, I can do it in each scenario by using:
-train_from + -continue + -update_vocab merge
-train_from + -update_vocab merge
-train_from + -continue
However, in pytorch version OpenNMT, there seems only -train_from option. So how can I implement incremental learning in these three scenarios in OpenNMT-py?
Thanks for replying @guillaumekln.
Since a model’s topology and vocabularies can’t be changed, when incorporating new data to an existing model, what is the best practice to preprocess the new data in OpenNMT-py? Using old vocabulary(old.vocab.pt) with -share_vocab option for new data to generate new.training.pt and new.validation.pt files?
P.S. All data was pre-processed with same BPE model.
Dose anyone have ideas on how to preprocess the data when incorporating new data to existing models in OpenNMT-py? I appreciate if you can share any comments, thoughts, or experience regarding this topic.
Thank you for your reply. I’m a bit confused - since there is only one vocab.pt file generated from preprocessing in OpenNMT-py, does that mean src_vocab and tgt-vocab can be specified to the same file?