Retraining with new vocabulary: fine-tuning

(Zuzanna Parcheta) #1

I trained a model with some vocabulary. Now I want to conduct fine tuning on this model with sentences containing new vocabulary.
My training command is the following:

th /home/torch/OpenNMT/train.lua -data ${path_}/preprocessed-datasets/preprocessed-train.t7 \
-train_from ${path_}/models/_epoch7_19.78.t7 \
-continue \
-rnn_size 512 \
-encoder_type rnn \
-rnn_type LSTM \
-end_epoch 47 \
-max_batch_size 20 \
-save_model ${path_}/models/ \
-layers 1 \
-dropout 0.2 \
-update_vocab merge \
-optim adam \
-learning_rate 0.0002 \
-learning_rate_decay 1.0 \
-src_word_vec_size 512  \
-tgt_word_vec_size 512  \
-gpuid 1

I get this error:

[06/05/18 14:36:03 INFO] Using GPU(s): 1
[06/05/18 14:36:03 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0
[06/05/18 14:36:03 INFO] Training Sequence to Sequence with Attention model…
[06/05/18 14:36:03 INFO] Loading data from ‘/home/German/infreq-exp/it-domain/fine-tuning/bleu100/preprocessed-datasets/preprocessed-train.t7’…
[06/05/18 14:36:03 INFO] * vocabulary size: source = 9412; target = 10017
[06/05/18 14:36:03 INFO] * additional features: source = 0; target = 0
[06/05/18 14:36:03 INFO] * maximum sequence length: source = 50; target = 51
[06/05/18 14:36:03 INFO] * number of training sentences: 11352
[06/05/18 14:36:03 INFO] * number of batches: 591
[06/05/18 14:36:03 INFO] - source sequence lengths: equal
[06/05/18 14:36:03 INFO] - maximum size: 20
[06/05/18 14:36:03 INFO] - average size: 19.21
[06/05/18 14:36:03 INFO] - capacity: 100.00%
[06/05/18 14:36:03 INFO] Loading checkpoint ‘/home/German/infreq-exp/it-domain/fine-tuning/bleu100/models/_epoch7_19.78.t7’…
[06/05/18 14:36:05 WARNING] Cannot change dynamically option -tgt_word_vec_size. Ignoring.
[06/05/18 14:36:05 WARNING] Cannot change dynamically option -src_word_vec_size. Ignoring.
[06/05/18 14:36:05 INFO] Resuming training from epoch 8 at iteration 1…
[06/05/18 14:36:05 INFO] * new source dictionary size: 9412
[06/05/18 14:36:05 INFO] * new target dictionary size: 10017
[06/05/18 14:36:05 INFO] * old source dictionary size: 26403
[06/05/18 14:36:05 INFO] * old target dictionary size: 26989
[06/05/18 14:36:05 INFO] * Merging new / old dictionaries…
[06/05/18 14:36:05 INFO] Updating the state by the vocabularies of the new train-set…
[06/05/18 14:36:06 INFO] * Updated source dictionary size: 26826
[06/05/18 14:36:06 INFO] * Updated target dictionary size: 27366
[06/05/18 14:36:08 INFO] Preparing memory optimization…
[06/05/18 14:36:08 INFO] * sharing 66% of output/gradInput tensors memory between clones
[06/05/18 14:36:08 INFO] Restoring random number generator states…
[06/05/18 14:36:08 INFO] Start training from epoch 8 to 47…
[06/05/18 14:36:08 INFO]
/home/torch/install/bin/luajit: ./onmt/train/Optim.lua:277: bad argument #2 to ‘add’ (sizes do not match at /home/torch/extra/cutorch/lib/THC/generated/…/generic/
stack traceback:
[C]: in function ‘add’
./onmt/train/Optim.lua:277: in function ‘adamStep’
./onmt/train/Optim.lua:147: in function ‘prepareGrad’
./onmt/train/Trainer.lua:272: in function ‘trainEpoch’
./onmt/train/Trainer.lua:439: in function ‘train’
/home/torch/OpenNMT/train.lua:337: in function ‘main’
/home/torch/OpenNMT/train.lua:342: in main chunk
[C]: in function ‘dofile’
/home/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

Some idea about what I am doing wrong?

Incremental training
(Guillaume Klein) #2

You probably don’t want to use -continue in this case. If you set it, the previous optimization states will be restored but your new model doesn’t have the same number of parameters.

(Zuzanna Parcheta) #3

Now it works. Thanks a lot.

(Wiktor Stribiżew) #4

Hi Sasanita, did you use the new corpus to pre-process and get the train.t7 file? How big is it? I am asking because I have a scenario with just a couple thousand new sentences, and I am trying to run an incremental training, but I am thinking of just using the generic training data and the new corpus as validation data (to pre-process and run for a couple of epochs). Not sure it is a valid approach.

@guillaumekln, can you please share your thoughts about that? It is of interest since I could not find a way to “tune” an ONMT generic model with tiny/small amount of new data in an appropriate way.

(Panos Kanavos) #5

Hi @wiktor.stribizew,

You can take advantage of your new corpus if you mix it with some of your generic corpus. Ideally, you should mix it with somehow related generic data. I mean, if your new corpus has different terminology than your generic, try not to mix portions of your generic corpus that contain that conflicting terminology.

Training should be much faster than training from scratch, so you can try various combinations of start and end epochs and see what works best.

(Wiktor Stribiżew) #6

I see your point, and it makes sense. However, I am not sure where new corpus should go to: training data, validation data, or both? When I have a new custom corpus of about 2000 sentences, I do not think I can afford to sacrifice 1K to a validation chunk, that is 50%, a lot.

(Panos Kanavos) #7

Only use your new data for training. If you have some related sentences from your generic corpus that make sense to be used for validation, extract your validation set from there.
Otherwise, do not use a validation set at all and keep an eye on the perplexity during training.

(Terence Lewis) #8

I have been doing some retrainings with a small number of sentence pairs and without a validation set. The perplexity has been below 3.00 from epoch 1. I am training for 26 epochs and this enables the network to learn the new in-domain terminology so that I get, for example, “waste water drainage” instead of “dirty water drainage”. With a few handy scripts going from a memoQ project tmx to a specialised model can take around 90 minutes.