Incremental training - size of new training data and vocabulary updating

vito.mandorino · May 4, 2017, 9:53am

If I get new training data for an already pretrained engine, I understand that one can perform incremental training by passing the pretrained model as arugment to the -train_from option.

In doing so, should the procedure be different if the new training data consists of just a few parallel segments or 1k, or 100k? Should one mix the new training data with a part of the old one if the new data only consists of a few parallel segments?

Also, is there a way to add the new words found in the new training data to the vocabulary? The documentation specifies that when training from an existing model, the vocabularies cannot be changed, however as far as I understand the vocabulary plays a role only in the word embedding matrix and not in the actual topology of the network.

cservan · May 4, 2017, 2:40pm

Hello Vito,
please take a look at this thread : Problem with incremental/in-domain training
I think it would answer your questions.
If not, feel free to ask again.

Cheers

vito.mandorino · May 4, 2017, 4:40pm

Thank you Christophe. I might be wrong but on that thread I haven’t found specific remarks as far as the size of the additional training corpus is concerned. For the vocabulary updating, the BPE or Morfessor approach could be a solution indeed, and worth experimenting as suggested in the thread. I think that the question whether updating the vocabulary is feasible would still be an interesting one if one is not using word segmentation.