I trained a model with some vocabulary. Now I want to conduct fine tuning on this model with sentences containing new vocabulary.
My training command is the following:
[06/05/18 14:36:03 INFO] Using GPU(s): 1
[06/05/18 14:36:03 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0
[06/05/18 14:36:03 INFO] Training Sequence to Sequence with Attention modelā¦
[06/05/18 14:36:03 INFO] Loading data from ā/home/German/infreq-exp/it-domain/fine-tuning/bleu100/preprocessed-datasets/preprocessed-train.t7āā¦
[06/05/18 14:36:03 INFO] * vocabulary size: source = 9412; target = 10017
[06/05/18 14:36:03 INFO] * additional features: source = 0; target = 0
[06/05/18 14:36:03 INFO] * maximum sequence length: source = 50; target = 51
[06/05/18 14:36:03 INFO] * number of training sentences: 11352
[06/05/18 14:36:03 INFO] * number of batches: 591
[06/05/18 14:36:03 INFO] - source sequence lengths: equal
[06/05/18 14:36:03 INFO] - maximum size: 20
[06/05/18 14:36:03 INFO] - average size: 19.21
[06/05/18 14:36:03 INFO] - capacity: 100.00%
[06/05/18 14:36:03 INFO] Loading checkpoint ā/home/German/infreq-exp/it-domain/fine-tuning/bleu100/models/_epoch7_19.78.t7āā¦
[06/05/18 14:36:05 WARNING] Cannot change dynamically option -tgt_word_vec_size. Ignoring.
[06/05/18 14:36:05 WARNING] Cannot change dynamically option -src_word_vec_size. Ignoring.
[06/05/18 14:36:05 INFO] Resuming training from epoch 8 at iteration 1ā¦
[06/05/18 14:36:05 INFO] * new source dictionary size: 9412
[06/05/18 14:36:05 INFO] * new target dictionary size: 10017
[06/05/18 14:36:05 INFO] * old source dictionary size: 26403
[06/05/18 14:36:05 INFO] * old target dictionary size: 26989
[06/05/18 14:36:05 INFO] * Merging new / old dictionariesā¦
[06/05/18 14:36:05 INFO] Updating the state by the vocabularies of the new train-setā¦
[06/05/18 14:36:06 INFO] * Updated source dictionary size: 26826
[06/05/18 14:36:06 INFO] * Updated target dictionary size: 27366
[06/05/18 14:36:08 INFO] Preparing memory optimizationā¦
[06/05/18 14:36:08 INFO] * sharing 66% of output/gradInput tensors memory between clones
[06/05/18 14:36:08 INFO] Restoring random number generator statesā¦
[06/05/18 14:36:08 INFO] Start training from epoch 8 to 47ā¦
[06/05/18 14:36:08 INFO]
/home/torch/install/bin/luajit: ./onmt/train/Optim.lua:277: bad argument #2 to āaddā (sizes do not match at /home/torch/extra/cutorch/lib/THC/generated/ā¦/generic/THCTensorMathPointwise.cu:217)
stack traceback:
[C]: in function āaddā
./onmt/train/Optim.lua:277: in function āadamStepā
./onmt/train/Optim.lua:147: in function āprepareGradā
./onmt/train/Trainer.lua:272: in function ātrainEpochā
./onmt/train/Trainer.lua:439: in function ātrainā
/home/torch/OpenNMT/train.lua:337: in function āmainā
/home/torch/OpenNMT/train.lua:342: in main chunk
[C]: in function ādofileā
/home/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
You probably donāt want to use -continue in this case. If you set it, the previous optimization states will be restored but your new model doesnāt have the same number of parameters.
Hi Sasanita, did you use the new corpus to pre-process and get the train.t7 file? How big is it? I am asking because I have a scenario with just a couple thousand new sentences, and I am trying to run an incremental training, but I am thinking of just using the generic training data and the new corpus as validation data (to pre-process and run for a couple of epochs). Not sure it is a valid approach.
@guillaumekln, can you please share your thoughts about that? It is of interest since I could not find a way to ātuneā an ONMT generic model with tiny/small amount of new data in an appropriate way.
You can take advantage of your new corpus if you mix it with some of your generic corpus. Ideally, you should mix it with somehow related generic data. I mean, if your new corpus has different terminology than your generic, try not to mix portions of your generic corpus that contain that conflicting terminology.
Training should be much faster than training from scratch, so you can try various combinations of start and end epochs and see what works best.
I see your point, and it makes sense. However, I am not sure where new corpus should go to: training data, validation data, or both? When I have a new custom corpus of about 2000 sentences, I do not think I can afford to sacrifice 1K to a validation chunk, that is 50%, a lot.
Only use your new data for training. If you have some related sentences from your generic corpus that make sense to be used for validation, extract your validation set from there.
Otherwise, do not use a validation set at all and keep an eye on the perplexity during training.
I have been doing some retrainings with a small number of sentence pairs and without a validation set. The perplexity has been below 3.00 from epoch 1. I am training for 26 epochs and this enables the network to learn the new in-domain terminology so that I get, for example, āwaste water drainageā instead of ādirty water drainageā. With a few handy scripts going from a memoQ project tmx to a specialised model can take around 90 minutes.
Iāve been doing some further retraining or āspecializationā of a baseline model in the past week based on a 6K sentence-pair project translation memory in the field of electrical engineering. I ran the training for a further 17 epochs (after the original default of 13) on the new data without a validation set. The process took around two hours from the time I got the project TM until the time I tested the new āspecializedā model. The translator has reported that all the technical terms are now being translated correctly.
So, you trained for a total of 30 epochs using a mix of your previous corpus with the 6K sentences and a merged vocab? If so, how many sentences had your new mixed corpus? Isnāt 30 epochs too many or were you checking until you got the results you wanted (perplexity-wise or terminology-learned-wise)?
Hi Panosk,
No, I didnāt do it that way. I had already trained a baseline model with a corpus of approx. 11M sentences for 13 epochs. I preprocessed the ānewā data in the usual way (Iām still using Lua) to create a training package (*.t7). I trained the existing model with the new training package and the -train_from option for 17 epochs. 17 epochs may be too many but that was just a guesstimate on my part. This wasnāt a research project but a practical response to a translator who was getting too many OOVās. I tested and found the ānew terminologyā was being used in the translation and quickly made the new model available to the translator. Time was of the essence as the translator was facing a deadline on a huge project. Of course, in ideal circumstances I should experiment with proportions of āoldā and ānewā.