Problem with incremental/in-domain training

What kind of learning rate curve did you use ?

I didn’t specify a learning rate as an option. The displayed learning rate progression was:
Epoch 14 : 0.0312
Epoch 15 : 0.0156
Epoch 16 : 0.0078
Epoch 17 : 0.0039
Epoch 18 : 0.0020
Epoch 19 : 0.0010
Epoch 20 : 0.0005
I’m afraid I don’t know enough yet for these number to be meaningful to me.

For me, these LR should imply very small tunings of the model. Strange that you damaged your model in a so visible way.

Perhaps, like me in different experiments (I never tried to specialized a pre-built generic model, I always built a model from scratch mixing a in-domain data set with a larger generic set), you come to the point where you may need my w2v-coupling procedure (a bit hard to get without a built-in w2v implementation).

:wink:

Ah… try using the general model as a launching point (-train_from), but don’t “-continue” from there, if that makes sense…

2 Likes

Hi David,
I had already noted your approach and was intending to try it out when the time comes to add some “real” domain-specific data to build a specialist model.
This exercise merely involved adding some short, colloquial sentences to counter the formality of most of EuroParl. I didn’t make any note of the OOV relationship.

Hi there!
Terence, it looks so strange that you managed to change that much and in a bad way your model, sometimes this repetition phenomena has to do with the number of epochs or even with the encoder architecture (for me, the biencoder and training for some more epochs managed to control this phenomena). There is a post here that has a discussion about that:

http://forum.opennmt.net/t/some-strange-translation-errors-is-it-a-bug/277?u=emartinezvic

As @dbl says, maybe you should only “train_from” instead of “train_from” and “continue”.
I am not quite sure about the difference here but, I think that the “continue” option is to restart a training from a checkpoint and the “train_from” is to specialize a pre-trained model.

I can tell you that I observed improvements each time I specialized a model by means of the “train_from” option with default settings (just adding the -train_from new_data.t7_path to my train.lua command line).

Regarding to OOVs, as @cservan said before, the most common way to deal with them is to use BPE or Morfessor segmentations into subword units.

I hope this can help you :slight_smile:

2 Likes

Thanks, Eva. As I basically wanted to increase the model’s ability to deal with colloquial material I’m currently training a new general model with a greater proportion of colloquial data. However, I will need to train more in-domain in the near future and this will be an interesting challenge.
Currently I’m avoiding most OOV’s via the phrase-table option making use of a very large single-word dictionary (350K words) which I have from an old-rule based system.

Hello,
@emartinezVic the option “train_from” is needed by the option “continue”.
As far as Iknow, “train_from” can use either a checkpoint model or an epoch model, as both are models…
The option “continue” enables to continue a training according to the training options stored into the model.
For instance, if your training process crashed (for any reason), you can restart it in this way:

th train.lua -train_from myLastCheckPointOrModel -continue

This also means, it is irrelevant to use the option “continue” for the specialization process.

1 Like

Hi @cservan,

The difference, (and @guillaumekln or @srush please correct me if I’m wrong) is that if you use -train_from alone, you are starting a “new” training from the weights, embeddings, and vocab of the model (effectively resetting start_decay_at and learning_rate_decay). If you also use -continue, you are continuing that training, even if you’re doing it with new data (and picking up the decay stuff from your first run).

2 Likes

thank you both @cservan and @dbl for your explanation!

it is clearer for me now how the train_from works :blush:

You’re welcome @emartinezVic :wink:

@cservan
in this paper: https://arxiv.org/pdf/1612.06141.pdf
what was the learning rate used for the additional epochs (for in-domain data) ?
thanks

Hello Vincent,
the learning rate is fixed to 1 then, a decay of 0.7 is applied.

FYI: http://www.statmt.org/wmt17/pdf/WMT13.pdf

2 Likes

I thought I would report -in case it’s useful to others - that I’ve been specialising some models over the last few days. I have been launching using -train_from from Epoch 13 in my general model, using -update_vocab. It has generally taken me a further 13 epochs to get my model to change its beliefs and translate with the “new terminology” contained in my additional training material. An example would be “aanbestedingsdocument” which was previously translated (not wholly incorrectly) as “tender document” and with the “retrained” model is now translated with the preferred translation of “procurement document”. The successful retraining session which took 52000 “new” sentences lasted just under 2 hours on a machine with a GTX 1080Ti GPU.

2 Likes

Hi Terence,
this sound great. Out of curiosity what learning rate did you use during the additional 13 epochs ?
Cheers.

Hi Vincent,
Been away from my machines! Just got back and looked at the training log and see.
epoch 14 LR = 0.08751 decaying to
epoch 26 LR = 0.00050
Retraining had started from Epoch 1 with (LR = 1.0) only submitting the new material
and decay of 0.65 had started at Epoch 4. I used -train_from without -continue but was surprised to see training start from Epoch 1. However, I am pleased with the result as I now have an effective way of getting post-edited translations in memoQ into the model in a reasonably short time.
Cheers,
Terence

Hello Terence, and thank you for your interesting reporting on specialization issue.
I would like to ask:
(1) what option did you use for the -update_vocab i.e. merge or replace? ,
(2) are there any single-word sentences in your “new” sentences?
(3) after the retraining with the new data, did you still use a phrase_table (that you mentioned in a previous post) when you tested translations?
Thanks in advance!

Hello Anna,

  1. I used the “merge” option but I suppose that if you want to train a highly specialised model “replace” would be the logical option. If you are thinking of specialising with a few thousand sentences this should be a very quick process so you could see what gives you your desired outcome.
  2. There have been no single-word sentences - I have tended to specialise with “project tmx” files from real jobs and these have generally been with technical documents.
  3. I am still using the back-off phrase_table. That is always consulted for Unknown Tokens after the inference process so if the specialisation process has worked well it will not be consulted. Logically I could omit it :slight_smile:

With all these issues it’s a question of seeing what works best for you.

Terence

Hi Terrence,

Thanks for your reply.
Another thing I would like to clarify is this: For re-training an existing engine with new data, and if this is small data compared to the existing engine (i.e. 1-2K compared to 400K), we first run the preprocess.lua command which produces new (small) *dict files and the new data model. From what I have read, while retraining we have to use the same *dict files that the existing model was trained with and use the -update_vocab option.
So what action do we have to take, practically? Because in the retrain command the *dict files are not mentioned, but in the log file I notice that actually the small ones are considered. I hope I did not confuse you…
Anna