I have trained a model using parallel corpora. And I have demo.train.pt、demo.valid.pt、demo.vocab.pt and a NMT-model.
Now I want to fine-tunning by domain-parallel corpora. I know I can use ’ -train_from ', but I don’t know details.
For example, I should preprocess domain-parallel corpora, and get train.pt/valid.pt/vocab.pt, and use train.pt/valid.pt/old-vocab.pt to train from old model?
Hi Patrick,
This is described well in http://opennmt.net/OpenNMT/training/retraining/.
You can now use -update_vocab to merge the vocabulary from the “old model” with the new vocabulary. I’m actually doing this myself this morning
Sorry I use onmt.py…
Terence, I am not sure I understand it completely, but if we use -update_vocab merge
with train.lua
, while tokenizing the new corpus with BPE model created from the new corpus, which BPE model we should use to later translate new files with the incremented model? Should we somehow merge BPE model (for the generic model) and the BPE model based on the new corpus?
Wiktor,
The logic behind your question is correct, but I did not use BPE for these models so that issue did not arise. I did however make sure the tokenization options were identical for both vocabs. Perhaps somebody who’s done this with BPE models could answer this categorically.
Terence
Just to complete this subthread: I got an answer from the Gitter channel, the original BPE model should be used.
Thanks for following this up. I’m currently working through the same challenges with SentencePiece and TensorFlow. -)