How to fine-tunning a model using domain-data?


(Patrick Liu) #1

I have trained a model using parallel corpora. And I have demo.train.pt、demo.valid.pt、demo.vocab.pt and a NMT-model.
Now I want to fine-tunning by domain-parallel corpora. I know I can use ’ -train_from ', but I don’t know details.
For example, I should preprocess domain-parallel corpora, and get train.pt/valid.pt/vocab.pt, and use train.pt/valid.pt/old-vocab.pt to train from old model?


(Terence Lewis) #2

Hi Patrick,
This is described well in http://opennmt.net/OpenNMT/training/retraining/.
You can now use -update_vocab to merge the vocabulary from the “old model” with the new vocabulary. I’m actually doing this myself this morning :slight_smile:


(Patrick Liu) #3

Sorry I use onmt.py…


(Wiktor Stribiżew) #4

Terence, I am not sure I understand it completely, but if we use -update_vocab merge with train.lua, while tokenizing the new corpus with BPE model created from the new corpus, which BPE model we should use to later translate new files with the incremented model? Should we somehow merge BPE model (for the generic model) and the BPE model based on the new corpus?


(Terence Lewis) #5

Wiktor,
The logic behind your question is correct, but I did not use BPE for these models so that issue did not arise. I did however make sure the tokenization options were identical for both vocabs. Perhaps somebody who’s done this with BPE models could answer this categorically.
Terence