In-domain training

Here is a quantified experiment of this w2v-coupled method (with fixed embeddings, since ONMT currently doesn’t want to update vec layers in the second part):

Having a stable training enables several possibilities that were very hard, or impossible, before without facing the bad over-fitting behaviour of NMT:

  1. long-training, that could bring much more finalized models than with an early-stopping procedure
  2. mixing N copies of a small in-domain data set with 1 copy of a large generic data set, so that, each epoch of the 1/N mixed data set is equivalent to 1 epoch of the generic data set and N epochs of the specific one. With N low, the in-domain data are not enough represented in front of the generic ones. With N high, each short-training is equivalent to a long-training with the in-domain data
  3. using of a much more large network that can handle more finely numerous long and complex formulations

I did this test on a FR->EN “food & cooking” data set:

  • 59319 sentences in the training set
  • 14829 sentences in the validation set

Of course, all double-pairs were removed.

10 copies of this training set were mixed with 2007723 Europarl sentences. The whole in-domain set is then almost 30% of the Europarl set. On the curve below, 8 epochs are equivalent to 80 epochs on the F&C training set.

ONMT config:

  • vector size 200
  • 2 layers of size 1000

Here is the current training curve (I will update it when next epochs will be done):

Keep in mind that the validation PPL is not pertinent, as said here:

The BLEU evaluation also is not perfect, but here it is, done on the validation set:

Here are some translation samples (note that there are also errors in the original data):
SRC: Mettez au frigo en attente , mais ne servez pas trop froids .
REF: Refrigerate until needed , but do not serve too cold .
ONMT: Keep in the fridge , but not serve too cold .

SRC: Servir en entrée ou en accompagnement d’ une viande grillée.variante :
REF: Serve as an appetizer or as a side dish for grilled meat.variation :
ONMT: Serve as an appetizer or as a side dish for grilled meat :

SRC: Commencez par éplucher les pommes de terre puis coupez les selon la forme souhaitée .
REF: Start by peeling the potatoes and cutting them into the desired shape .
ONMT: Start by peeling the potatoes and cut them into the desired form .

SRC: saumons destinés à la fabrication de pâté ou de pâte à tartiner
REF: salmon for manufacture into pastes or spreads
ONMT: salmon destined for the manufacture of pâté or batter

SRC: Cuire au four pendant 12 minutes ou jusqu ’ à ce que les asperges soient cuites , mais toujours croquantes .
REF: Bake for 12 minutes , or until asparagus is cooked but still crisp .
ONMT: Bake for 12 minutes or until asparagus is cooked , but still crisp .

SRC: 1 kg de myrtilles sauvages , 400 g de sucre , 1/4 de citron non traité , avec sa peau .
REF: 1 kg fresh blueberries 150 ml water 350 g granulated sugar ¼ unwaxed lemon with peel on
ONMT: 1 kg of wild blueberries , 400 g granulated sugar a knob of butter ( optional )

SRC: Le vent sèche le dessus de l’ eau très concentrée en sel ( qui deviendras plus tard du sel de mer ) et donne des cristaux très fins : la fleur de sel .
REF: The wind dries the surface of the water which has a high salt concentration ( this will later become sea salt ) producing thin flaky crystals .
ONMT: The dry wind of very concentrated water in salt ( who later comes to sea salt ) and gives very thin crystal : the flower flower .

SRC: L’ activité d’ une antitoxine ou d’ un antisérum doit être déterminée par une méthode acceptable et , lorsqu ’ il y a lieu , l’ unité d’ activité doit être l’ unité internationale .
REF: The potency of an antitoxin or antiserum shall be determined by an acceptable method and where applicable the unit of potency shall be the International Unit .
ONMT: The potency of antitoxin or antiserum shall be determined by an acceptable method and , the unit of potency must be the International Unit .

SRC: Après les avoir bien blanchies , on fait confire tout doucement des lanières d’ écorce de pamplemousse .
REF: After soakiing , strips of grapefruit skin are cooked very slowly to conservethem in sugar .
ONMT: After having blanched well , you can cook it with strips of grapefruit peel .

SRC: 1 heure Mettez cette pâte en forme de galette , entourez la de film étirable et mettez au frigo pour 1 heure ou 2 .
REF: 1 hour Form the dough into a flat cake and wrap in stretch plastic film . Put in the fridge for 1 or 2 hours .
ONMT: 1 hour Put this dough into a flat cake and wrap in stretch plastic film . Refrigerate for 1 or 2 hours .


1 Like

Maybe it would be good to compare to a baseline, ie the plain training of the same corpus, same network size and letting onmt deal with the embeddings. (and same training parameters)

All my previous attempts to learn such a mixed corpus with a standard configuration failed (with poor results). If you know someone that succeeded, I would enjoy to learn about it.

I haven’t tried with data this diverse yet, but I have had some success in a similar vein, and I didn’t need to use w2v or fiddle with embeddings.

  1. Combine your Europarl data with (a single copy of) your food data.
  2. I like large vocabularies, so I take anything with frequency > 2 (or 3, depending on the dataset).
  3. Use your food validation set & train for 13 epochs.
  4. Use your food test set to determine which of the 13 epochs gives best performance (it won’t be great).
  5. Now use that model as a launching point and train 13 epochs with only food data.
  6. Again, use test set to select best model.

If you care to try this method, I’d be keen to see how it stacks up to your approach. :slight_smile:

1 Like

What kind of BLEU did you obtained ?

As I mentioned, slightly different data, so not comparable, but the best general was around 30, while the best adapted was 37.

Furthermore, I’m not convinced that BLEU is the best metric, especially how it’s used by nearly everyone in both industry and academia today. (When is the last time you saw a paper that used BLEU with multiple reference segments per hypothesis?)

1 Like

Of course ! I would enjoy to get an other evaluation tool… as (nearly) said in my first post above.

Just as a comparison, a Moses trained on the whole data set (59.3k+14.8k), and Europarl, is evaluated with a BLEU of 31.45. At this time, the ONMT model is at epoch 8 with a BLEU of 47.01, still growing.

1 Like

Let me take the problem in an other side…

I’m experimenting with NMT (with ONMT) for several weeks. NONE of the networks I built was able to do a long-training without overfitting. It’s a fact admitted in all publications about NMT. The networks I built with w2v embeddings ARE able to do long and very-long-training in a completely stable way.

My original goal was to LET the network change the embeddings, AFTER a first w2v-coupling run. In this case, what is the main difference between a random initialization of the network and embeddings, and a w2v-coupling initialization ? On the principle, it’s just an other way to start with some network and some embeddings, whatever they are…

The bad thing now, is that, after having done the first coupling run, thus with fixed embeddings option, ONMT doesn’t want to forget about this fixed option, as said here:

I have to do with it, till this will be changed in the code…

Unless I am wrong embedding pre-training is in the roadmap. But the best way to prove the gain is to make a fair comparison to a baseline. Results can be great with high BLEU scores, but you always need a baseline to compare.

In your scenario the baseline would be Europarl + 10x in-domain 59K data = training set
run it 8 epochs with same parameters, show PPL + Bleu.
then we can compare with your scenario.


You might also consider building separate Moses models and doing a linear interpolation of them (interpolation of LMs is fairly straightforward; interpolation of phrase tables can be messier); it’s the best SMT analogy I can think of to what we’re both trying to achieve with the “domain adaptation” of a general NMT model.

also, counts matter for Moses, so probably best to multiply the in-domain data if you take the one model (vs interpolation) approach.

I really enjoy to exchange about what could be done to make a clean and not contestable proof. It’s really interesting !

This said, keep in mind that I’m neither publishing, nor trying to convince you. My own opinion is done, and I have a procedure (except for the fixed option that ONMT doesn’t want to forget).

My only goal here is to share with you my experiment and my point of view, argue a bit about it, and have this nice open discussion. It’s up to everyone to be convinced or not, and to do its own test or not… clean or not.


I’ll certainly be trying out this approach. It has been one of the issues on my mind as I need to be able to offer customers the possibility of having a custom-built engine.

1 Like

@tel34, don’t hesitate to try the w2v-coupling procedure to standard training cases. It worth it…
…of course, nothing obliged you to believe me.


Your are right. Training Moses with 10x60k+Europarl provide me with a BLEU=39.98.
Good !

@dbl, @vince62s, thanks to your exchanges, I looked more precisely how Moses was evaluated. I discovered that the test set used to evaluate its BLEU contains 1) a large portion of sentences really different from the one in the training set, and 2) a lot of not pertinent small lines and various kind of errors. So, I’m currently doing finer cross comparison.