How to fine-tunning a model using domain-data?

PengboLiu · March 22, 2018, 4:02am

I have trained a model using parallel corpora. And I have demo.train.pt、demo.valid.pt、demo.vocab.pt and a NMT-model.
Now I want to fine-tunning by domain-parallel corpora. I know I can use ’ -train_from ', but I don’t know details.
For example, I should preprocess domain-parallel corpora, and get train.pt/valid.pt/vocab.pt, and use train.pt/valid.pt/old-vocab.pt to train from old model?

tel34 · March 22, 2018, 10:31am

Hi Patrick,
This is described well in http://opennmt.net/OpenNMT/training/retraining/.
You can now use -update_vocab to merge the vocabulary from the “old model” with the new vocabulary. I’m actually doing this myself this morning

PengboLiu · March 23, 2018, 6:52am

Sorry I use onmt.py…

wiktor.stribizew · April 6, 2018, 11:39am

Terence, I am not sure I understand it completely, but if we use -update_vocab merge with train.lua, while tokenizing the new corpus with BPE model created from the new corpus, which BPE model we should use to later translate new files with the incremented model? Should we somehow merge BPE model (for the generic model) and the BPE model based on the new corpus?

tel34 · April 6, 2018, 1:24pm

Wiktor,
The logic behind your question is correct, but I did not use BPE for these models so that issue did not arise. I did however make sure the tokenization options were identical for both vocabs. Perhaps somebody who’s done this with BPE models could answer this categorically.
Terence

wiktor.stribizew · March 14, 2019, 12:28pm

Just to complete this subthread: I got an answer from the Gitter channel, the original BPE model should be used.

tel34 · March 14, 2019, 12:49pm

Thanks for following this up. I’m currently working through the same challenges with SentencePiece and TensorFlow. -)