Retraining with new vocabulary: fine-tuning

Sasanita · June 5, 2018, 12:57pm

I trained a model with some vocabulary. Now I want to conduct fine tuning on this model with sentences containing new vocabulary.
My training command is the following:

th /home/torch/OpenNMT/train.lua -data ${path_}/preprocessed-datasets/preprocessed-train.t7 \
-train_from ${path_}/models/_epoch7_19.78.t7 \
-continue \
-rnn_size 512 \
-encoder_type rnn \
-rnn_type LSTM \
-end_epoch 47 \
-max_batch_size 20 \
-save_model ${path_}/models/ \
-layers 1 \
-dropout 0.2 \
-update_vocab merge \
-optim adam \
-learning_rate 0.0002 \
-learning_rate_decay 1.0 \
-src_word_vec_size 512  \
-tgt_word_vec_size 512  \
-gpuid 1

I get this error:

[06/05/18 14:36:03 INFO] Using GPU(s): 1
[06/05/18 14:36:03 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0
[06/05/18 14:36:03 INFO] Training Sequence to Sequence with Attention model…
[06/05/18 14:36:03 INFO] Loading data from ‘/home/German/infreq-exp/it-domain/fine-tuning/bleu100/preprocessed-datasets/preprocessed-train.t7’…
[06/05/18 14:36:03 INFO] * vocabulary size: source = 9412; target = 10017
[06/05/18 14:36:03 INFO] * additional features: source = 0; target = 0
[06/05/18 14:36:03 INFO] * maximum sequence length: source = 50; target = 51
[06/05/18 14:36:03 INFO] * number of training sentences: 11352
[06/05/18 14:36:03 INFO] * number of batches: 591
[06/05/18 14:36:03 INFO] - source sequence lengths: equal
[06/05/18 14:36:03 INFO] - maximum size: 20
[06/05/18 14:36:03 INFO] - average size: 19.21
[06/05/18 14:36:03 INFO] - capacity: 100.00%
[06/05/18 14:36:03 INFO] Loading checkpoint ‘/home/German/infreq-exp/it-domain/fine-tuning/bleu100/models/_epoch7_19.78.t7’…
[06/05/18 14:36:05 WARNING] Cannot change dynamically option -tgt_word_vec_size. Ignoring.
[06/05/18 14:36:05 WARNING] Cannot change dynamically option -src_word_vec_size. Ignoring.
[06/05/18 14:36:05 INFO] Resuming training from epoch 8 at iteration 1…
[06/05/18 14:36:05 INFO] * new source dictionary size: 9412
[06/05/18 14:36:05 INFO] * new target dictionary size: 10017
[06/05/18 14:36:05 INFO] * old source dictionary size: 26403
[06/05/18 14:36:05 INFO] * old target dictionary size: 26989
[06/05/18 14:36:05 INFO] * Merging new / old dictionaries…
[06/05/18 14:36:05 INFO] Updating the state by the vocabularies of the new train-set…
[06/05/18 14:36:06 INFO] * Updated source dictionary size: 26826
[06/05/18 14:36:06 INFO] * Updated target dictionary size: 27366
[06/05/18 14:36:08 INFO] Preparing memory optimization…
[06/05/18 14:36:08 INFO] * sharing 66% of output/gradInput tensors memory between clones
[06/05/18 14:36:08 INFO] Restoring random number generator states…
[06/05/18 14:36:08 INFO] Start training from epoch 8 to 47…
[06/05/18 14:36:08 INFO]
/home/torch/install/bin/luajit: ./onmt/train/Optim.lua:277: bad argument #2 to ‘add’ (sizes do not match at /home/torch/extra/cutorch/lib/THC/generated/…/generic/THCTensorMathPointwise.cu:217)
stack traceback:
[C]: in function ‘add’
./onmt/train/Optim.lua:277: in function ‘adamStep’
./onmt/train/Optim.lua:147: in function ‘prepareGrad’
./onmt/train/Trainer.lua:272: in function ‘trainEpoch’
./onmt/train/Trainer.lua:439: in function ‘train’
/home/torch/OpenNMT/train.lua:337: in function ‘main’
/home/torch/OpenNMT/train.lua:342: in main chunk
[C]: in function ‘dofile’
/home/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

Some idea about what I am doing wrong?

guillaumekln · June 5, 2018, 1:16pm

You probably don’t want to use -continue in this case. If you set it, the previous optimization states will be restored but your new model doesn’t have the same number of parameters.

Sasanita · June 5, 2018, 1:22pm

Now it works. Thanks a lot.

wiktor.stribizew · June 29, 2018, 10:50am

Hi Sasanita, did you use the new corpus to pre-process and get the train.t7 file? How big is it? I am asking because I have a scenario with just a couple thousand new sentences, and I am trying to run an incremental training, but I am thinking of just using the generic training data and the new corpus as validation data (to pre-process and run for a couple of epochs). Not sure it is a valid approach.

@guillaumekln, can you please share your thoughts about that? It is of interest since I could not find a way to “tune” an ONMT generic model with tiny/small amount of new data in an appropriate way.

panosk · June 29, 2018, 11:45am

Hi @wiktor.stribizew,

You can take advantage of your new corpus if you mix it with some of your generic corpus. Ideally, you should mix it with somehow related generic data. I mean, if your new corpus has different terminology than your generic, try not to mix portions of your generic corpus that contain that conflicting terminology.

Training should be much faster than training from scratch, so you can try various combinations of start and end epochs and see what works best.

wiktor.stribizew · June 29, 2018, 12:10pm

I see your point, and it makes sense. However, I am not sure where new corpus should go to: training data, validation data, or both? When I have a new custom corpus of about 2000 sentences, I do not think I can afford to sacrifice 1K to a validation chunk, that is 50%, a lot.

panosk · June 29, 2018, 12:23pm

Only use your new data for training. If you have some related sentences from your generic corpus that make sense to be used for validation, extract your validation set from there.
Otherwise, do not use a validation set at all and keep an eye on the perplexity during training.

tel34 · August 14, 2018, 2:23pm

I have been doing some retrainings with a small number of sentence pairs and without a validation set. The perplexity has been below 3.00 from epoch 1. I am training for 26 epochs and this enables the network to learn the new in-domain terminology so that I get, for example, “waste water drainage” instead of “dirty water drainage”. With a few handy scripts going from a memoQ project tmx to a specialised model can take around 90 minutes.

tel34 · October 16, 2018, 9:55am

I’ve been doing some further retraining or “specialization” of a baseline model in the past week based on a 6K sentence-pair project translation memory in the field of electrical engineering. I ran the training for a further 17 epochs (after the original default of 13) on the new data without a validation set. The process took around two hours from the time I got the project TM until the time I tested the new “specialized” model. The translator has reported that all the technical terms are now being translated correctly.

panosk · October 16, 2018, 12:54pm

Hi Terence,

So, you trained for a total of 30 epochs using a mix of your previous corpus with the 6K sentences and a merged vocab? If so, how many sentences had your new mixed corpus? Isn’t 30 epochs too many or were you checking until you got the results you wanted (perplexity-wise or terminology-learned-wise)?

tel34 · October 16, 2018, 1:55pm

Hi Panosk,
No, I didn’t do it that way. I had already trained a baseline model with a corpus of approx. 11M sentences for 13 epochs. I preprocessed the “new” data in the usual way (I’m still using Lua) to create a training package (*.t7). I trained the existing model with the new training package and the -train_from option for 17 epochs. 17 epochs may be too many but that was just a guesstimate on my part. This wasn’t a research project but a practical response to a translator who was getting too many OOV’s. I tested and found the “new terminology” was being used in the translation and quickly made the new model available to the translator. Time was of the essence as the translator was facing a deadline on a huge project. Of course, in ideal circumstances I should experiment with proportions of “old” and “new”.