Incremental vocab

Here is something that could be tested:

  1. for a given training set T1, build a vocab with N reserved unknown words UNK1, UNK2, UNK3… For example, 10K for 50K dicts
  2. train a model M1 with T1. Of course, UNKn won’t be used in this training… and will never occur while translating, thus, this should not be a problem.
  3. when a new training set T2 is coming, find unknown words having more than P occurrences in it (to avoid training with a single sentence, but of course, P could also be tested at 1), and rename some UNKn with this unknown words in original dicts
  4. retrain M1 with T2 to obtain model M2 with these new known words

Is this having a good chance to work ?

PS: of course, in step 4, it’s possible to train with T1+T2
PS: in step 2, P threshold can also be applied on T1+T2, causing some words previously ignored in step 1 to be taken into account in step 4
PS: in step 3, P=1 would be specially interesting if T2 is a in-domain set, where all words are potentially important
PS: in step 4, it’s possible to train with T1 + a subset of T2 where the new vocab occurs (interesting if T1 is a large generic set, and T2 a small in-domain set)
PS: this light way of doing, is certainly equivalent to a heavy post-refactoring of input/output net structure

1 Like

I have a nice occasion to test such a procedure. I have a in-domain data set D with 184K sentences, FR->DE. These data are very particular:

  • a lot of in-domain specific (technical) words. Some sentences, or part of sentences, are ordinary formulations, while others are quite specific.
  • many sentences are uppercased with or without diacritics. Some of them are present in several copies with different forms.

Due to the small size of the set, I need to train with generic sentences, to get general vocab and general formulations. The in-domain data are certainly not sufficient to get a good model. But:

  • in-domain specific words are OOV for the generic model
  • words without diacritics are also OOV, even if their well formed forms are in the generic model
  • fully uppercased sentences are confusing for the feature part of a training done with a generic model

First of all, I took 2K sentences V from D for the validation set. To avoid any kind of overlapping, I built training set T from D by removing all sentences similar to the one of V when converted as lowercased, without diacritic, and all numbers replaced by a single “8”. T is then 130K sentences. I finally took 2K sentences from T to build a checking set C.

1) mixing with 2M Europarl set E

I mixed 5 x T with E, so that in-domain data are around 30% of Europarl data. I made 5 epochs at LR=1, then 10 epochs decreasing it with a 0.7 factor each.

50K words in vocabs = all possible in-domain words + a complement from Europarl vocab (highest occurrences first).


Here is the PPL learning curve:

Here is the CHECK BLEU curve, ending with BLEU=89.36 on in-domain set C

Here is the VALID BLEU curve, ending with BLEU=35.7 on in-domain set V

2) Europarl alone

I built a checking set Ce of 2K sentences from Europarl training set E.

50K words in vocabs = 40K words from from Europarl vocab + 10K UNDEFn words.

Same learning strategy with 2M E as learning set.


Here is the PPL learning curve (of course huge in-domain valid PPL !):

Here is the CHECK BLEU curve, ending with BLEU=25.33 on Europarl set Ce

Here is the VALID BLEU curve, ending with (AWFUL !) BLEU=4.04 on in-domain set V (QED can be negative in such a situation):

3) updating Europarl model with in-domain retraining

50K words in vocabs = update of the step 2 vocabs where 10k UNDEFn words were replaced by in-domain vocab (highest occurrences first)

Same learning strategy, but starting with step 2 model, with 130K T as training set.


Of course, Europarl training is supposed to be done only once for many training…

Here is the PPL learning curve:

Here is the CHECK BLEU curve, ending with BLEU=71.16 on in-domain set C

Here is the VALID BLEU curve, ending with BLEU=34.0 on in-domain set V

34.0 is quite near from the 35.7 obtain at step 1. Certainly something better could be done, for example by mixing few copies of T together (like it’s the case in step 1) or increasing the number of epochs. Also, more than 10K UNDEFn words, replaced later by more in-domain words, can be used…

This 34.0 was obtained in 73MN, comparing to 57H for the 35.7…

Just a first experimentation…

Bad news… perhaps…

The retraining of step 3 is quite heavily damaging the generic capability of the model obtained in step 2 : the CHECK BLEU is falling down from 25.33 to 10.86 on Europarl set Ce.

Possibly this could be avoided by using some Europarl data somewhere in step 3. Few epochs with set E ? A part of set E mixed with set T ?

The key question is : is this generic capability damaging also damageable for the in-domain translation ? In fact, I obtained a 34.0 BLEU, on validation data carefully filtered to avoid overlapping with training data…

PS : when looking at the Ce translation, I find that many details was changed. But, a lot of the generic vocab is still well there, with a lot of no-so-bad sub parts. So, my guess is that the generic part of the model was sufficiently kept to do a quite good job on the generic parts and generic vocab of the in-domain sentences…

PS : an other idea is, in step 3, to use the mixed model of step 1 rather than the generic-only model of step 2, when a new in-domain set is coming (thus training in step 1 with some UNDEFn words). Guess it would bring the best of the 2 worlds, keeping a very fast in-domain retraining, with vocab increasing.

1 Like

A few months ago, I have tried this way too, because re-training model to spend a lot of time, the effect is very good. I think if model is trained into the domain, It has been there, like people, it is very difficult to divert, it is need to re-train too many things.

Netxiao, can you be more precise in your explanations ? Not sure to understand. As I read it, some parts seem ambiguous.

my steps:
build a dicts, add 30k unk tags as yf_unk1,yf_unk2…yf_unk30000.

  1. trained a generic model.
  2. trained a domain model from the generic model.
  3. if new domain corpus coming, use a script replace unk words to yf_unk1…yf_unk30000, and then re-training.

this is my simple idea :grinning:

1 Like

As a base line comparison, model from mixed method 1 obtains a CHECK BLEU 22.77 on the Europarl Ce set. A bit less than method 2 with 25.33, but far more than method 3 with 10.86.

It’s quite an evidence that good incremental method is certainly:

  1. make a first heavy train, mixing generic and in-domain data to get a first model M1. In the vocab, reserve some UNDEFn words than will neither be used in this training, nor in translations with model M1.
  2. when a new in-domain training set is coming, replace UNDEFn words by new vocab, and try a fast retraining of model M1 with in-domain data only, to get a new model M2.

This way, the first model M1, already trained on both generic and specific data, shouldn’t be to much damaged by the retraining, because there should be very few things to change.