Problem with incremental/in-domain training

tel34 · March 23, 2017, 10:19am

Hi,
I’ve started experimenting with incremental training to add some more colloquial data to my predominantly EuroParl based model. I’ve followed the guidlines provided here and I’ve reproduced my command line and the error messages below. Is there a command line option which would take care of this? Am I Mssing somethign simple?
Terence

#!/bin/ksh
th train.lua -data /home/tel34/OpenNMT/data/dutch_data/eng2ned_alpha-train.t7 -save_model /home/tel34/OpenNMT/data/dutch_data/model 
-train_from /home/tel34/OpenNMT/data/dutch_data/model_epoch13_1.76.t7 -continue -end_epoch 18 -gpuid 1
echo "Done!"
exit 0

tel34@Joshua:~/OpenNMT$ ~/OpenNMT/continue_training.sh
[03/23/17 10:01:52 INFO] Using GPU(s): 1
[03/23/17 10:01:52 INFO] Loading checkpoint '/home/tel34/OpenNMT/data/dutch_data/model_epoch13_1.76.t7'...
[03/23/17 10:01:55 INFO] Resuming training from epoch 14 at iteration 1...
[03/23/17 10:01:55 INFO] Training Sequence to Sequence with Attention model
[03/23/17 10:01:55 INFO] Loading data from '/home/tel34/OpenNMT/data/dutch_data/eng2ned_alpha-train.t7'...
[03/23/17 10:02:49 INFO]  * vocabulary size: source = 41551; target = 47388
[03/23/17 10:02:49 INFO]  * additional features: source = 0; target = 0
[03/23/17 10:02:49 INFO]  * maximum sequence length: source = 50; target = 51
[03/23/17 10:02:49 INFO]  * number of training sentences: 30426
[03/23/17 10:02:49 INFO]  * maximum batch size: 64
[03/23/17 10:02:49 INFO] Building model...
[03/23/17 10:02:49 INFO] Initializing parameters...
[03/23/17 10:02:51 INFO]  * number of parameters: 84814004
[03/23/17 10:02:51 INFO] Preparing memory optimization...
/home/tel34/torch/install/bin/luajit: /home/tel34/torch/install/share/lua/5.1/nn/THNN.lua:110: weight tensor should be defined either for all 50004 classes or no classes but got weight tensor of shape: [47388] at /home/tel34/torch/extra/cunn/lib/THCUNN/generic/ClassNLLCriterion.cu:44
stack traceback:
        [C]: in function 'v'
        /home/tel34/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'ClassNLLCriterion_updateOutput'
        ...l34/torch/install/share/lua/5.1/nn/ClassNLLCriterion.lua:44: in function 'updateOutput'
        ...l34/torch/install/share/lua/5.1/nn/ParallelCriterion.lua:23: in function 'forward'
        ./onmt/modules/Decoder.lua:359: in function 'backward'
        ./onmt/Seq2Seq.lua:111: in function 'trainNetwork'
        ./onmt/utils/Memory.lua:41: in function 'optimize'
        ./onmt/train/Trainer.lua:80: in function 'closure'
        ./onmt/utils/Parallel.lua:79: in function 'launch'
        ./onmt/train/Trainer.lua:62: in function 'train'
        train.lua:129: in function 'main'
        train.lua:134: in main chunk
        [C]: in function 'dofile'
        ...el34/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: at 0x00406670

Etienne38 · March 23, 2017, 10:23am

It’s impossible to change the vocab used to train a model.

emartinezVic · March 23, 2017, 10:42am

Hi Terence!
I had the same problem few days ago.
Then, I realized that, as @Etienne38 says,

In practice, that means that there exists a vocabulary restriction when performing a “continue”/resume training so, you should pre-process your new data using the model dictionaries.

By doing something like this:
th preprocess.lua -train_src src_new_data -train_tgt tgt_new_data -valid_src valid_src_new_data-valid_tgt valid_tgt_new_data -src_vocab yourModel_src.dict -tgt_vocab yourModel_tgt.dict -save_data yourNewData

With this you will get in yourNewData-train.t7 your pre-processed new-data taking into account the vocabulary you used to train your model.

However, don’t forget that with this you are not introducing new vocabulary to the model but letting it learn the semantics on your new data.

good luck!
Eva

cservan · March 23, 2017, 11:07am

Hello All,
The retraining process for domain adaptation (also called “specialization” process), has some restrictions.
for instance, you can’t change (or extends) the vocabulary of the model. This comes from the neural network model itself.
But, there are some tricks
The easiest approach to handle the out-of-vocabulary (OOV) problem, concerns the use of subword units (BPE, morfessor, etc.). This approach enables to cut OOV into subwords, which belong to the model’s vocabulary. In this way, you can retrain your models with an in-domain corpus processed with subword units, which have the same vocabulary that the original training corpus.
Note that both original training and retraining corpora have to be processed with the same subword model.

Cheers,
Christophe

References:

Morfessor: http://www.aclweb.org/anthology/E14-2006 (paper); https://github.com/aalto-speech/morfessor (code)
BPE: http://www.aclweb.org/anthology/P16-1162 (paper); https://github.com/rsennrich/subword-nmt (code)

tel34 · March 23, 2017, 1:12pm

Thanks Eva and Etienne38. It now carries on training nicely with the new data! Perhaps it’s worth adding this to the documentation for newcomers (like me) to neural MT

tel34 · March 23, 2017, 3:39pm

This has been an interesting experience. I continued training for a further 8 epochs using the vocabularies for the existing model which was giving me really good English2Dutch translations (getting better scores that Moses on a recent comparison). The resulting new model has delivered disastrous output, meaning just the initial word of a test sentence repeated over and over!
I’m wondering whether the reason could be that the new data comprised only 30,000 sentences - everything else was done by the book.
I will be investigating further.

dbl · March 23, 2017, 3:44pm

Hi Terrence,

My approach is to do my general training for 13 epochs, then use a test set from my domain specific data to determine my “best” model to launch another 13 epochs of domain-specific training. Test again to find the best model from the domain-adapted set of models.

Do you have an idea about what percentage of your domain-specific data is OOV with respect to the baseline model?

Etienne38 · March 23, 2017, 3:48pm

What kind of learning rate curve did you use ?

tel34 · March 23, 2017, 5:42pm

I didn’t specify a learning rate as an option. The displayed learning rate progression was:
Epoch 14 : 0.0312
Epoch 15 : 0.0156
Epoch 16 : 0.0078
Epoch 17 : 0.0039
Epoch 18 : 0.0020
Epoch 19 : 0.0010
Epoch 20 : 0.0005
I’m afraid I don’t know enough yet for these number to be meaningful to me.

Etienne38 · March 23, 2017, 5:52pm

For me, these LR should imply very small tunings of the model. Strange that you damaged your model in a so visible way.

Perhaps, like me in different experiments (I never tried to specialized a pre-built generic model, I always built a model from scratch mixing a in-domain data set with a larger generic set), you come to the point where you may need my w2v-coupling procedure (a bit hard to get without a built-in w2v implementation).

dbl · March 23, 2017, 5:56pm

Ah… try using the general model as a launching point (-train_from), but don’t “-continue” from there, if that makes sense…

tel34 · March 23, 2017, 6:05pm

Hi David,
I had already noted your approach and was intending to try it out when the time comes to add some “real” domain-specific data to build a specialist model.
This exercise merely involved adding some short, colloquial sentences to counter the formality of most of EuroParl. I didn’t make any note of the OOV relationship.

emartinezVic · March 24, 2017, 8:25am

Hi there!
Terence, it looks so strange that you managed to change that much and in a bad way your model, sometimes this repetition phenomena has to do with the number of epochs or even with the encoder architecture (for me, the biencoder and training for some more epochs managed to control this phenomena). There is a post here that has a discussion about that:

http://forum.opennmt.net/t/some-strange-translation-errors-is-it-a-bug/277?u=emartinezvic

As @dbl says, maybe you should only “train_from” instead of “train_from” and “continue”.
I am not quite sure about the difference here but, I think that the “continue” option is to restart a training from a checkpoint and the “train_from” is to specialize a pre-trained model.

I can tell you that I observed improvements each time I specialized a model by means of the “train_from” option with default settings (just adding the -train_from new_data.t7_path to my train.lua command line).

Regarding to OOVs, as @cservan said before, the most common way to deal with them is to use BPE or Morfessor segmentations into subword units.

I hope this can help you

tel34 · March 24, 2017, 9:39am

Thanks, Eva. As I basically wanted to increase the model’s ability to deal with colloquial material I’m currently training a new general model with a greater proportion of colloquial data. However, I will need to train more in-domain in the near future and this will be an interesting challenge.
Currently I’m avoiding most OOV’s via the phrase-table option making use of a very large single-word dictionary (350K words) which I have from an old-rule based system.

cservan · March 24, 2017, 9:44am

Hello,
@emartinezVic the option “train_from” is needed by the option “continue”.
As far as Iknow, “train_from” can use either a checkpoint model or an epoch model, as both are models…
The option “continue” enables to continue a training according to the training options stored into the model.
For instance, if your training process crashed (for any reason), you can restart it in this way:

th train.lua -train_from myLastCheckPointOrModel -continue

This also means, it is irrelevant to use the option “continue” for the specialization process.

dbl · March 24, 2017, 10:26am

Hi @cservan,

The difference, (and @guillaumekln or @srush please correct me if I’m wrong) is that if you use -train_from alone, you are starting a “new” training from the weights, embeddings, and vocab of the model (effectively resetting start_decay_at and learning_rate_decay). If you also use -continue, you are continuing that training, even if you’re doing it with new data (and picking up the decay stuff from your first run).

emartinezVic · March 24, 2017, 11:56am

thank you both @cservan and @dbl for your explanation!

it is clearer for me now how the train_from works

cservan · March 24, 2017, 4:57pm

You’re welcome @emartinezVic

vince62s · July 20, 2017, 12:36pm

@cservan
in this paper: https://arxiv.org/pdf/1612.06141.pdf
what was the learning rate used for the additional epochs (for in-domain data) ?
thanks

cservan · July 20, 2017, 1:04pm

Hello Vincent,
the learning rate is fixed to 1 then, a decay of 0.7 is applied.