Multilingual model


I’ve been training lots of models (ENG to lots of languages), but now I need to do the other way around for all of those languages.

What would be the best approach?

  1. One Multilingual model for all of them where I use a token before each segment to represent the input source?

  2. Many multilangual model?

And what is the best way to build the validation set with a multingual model? (5000 segments split among all languages?)

Best regards,

Hi Samuel,

I am sure you are aware of mBART. In the relevant paper, the authors tried to answer similar questions.

All the best,

1 Like

Using a token at the beginning seems like a straightforward way to do multilingual models. In my experience it’s been difficult to train an OpenNMT model with multiple types of functionality.

<to-es> Hello World -> Hola Mundo


Hello Yasmin,

Indeed I seen that before! but I wasn’t sure if anyone had experienced it with OpenNMT.

Argo seem to answer that part of the question! But i might try anyway since the data required for both approach is the same.

And for the validation set I will try to increase to 10k and randomly take from each of the validation set per language. (using 20 languages or so)

Hi Samuel!

Another point to take into consideration is dataset weights, if the data size is not the same for each language. In this case, you have to apply over-sampling. See the discussion here.

I also collected some notes here:

This would be a very interesting experiment; please keep us posted.

All the best,

1 Like

I will read all this and give feedback when i have done my testing!

Might take fews months as its not my immediate priority, but i like to figure out ahead my next steps. So i can meditate them and upgrade them even before starting coding :smile:

1 Like

Hello Yasmin,

Your documentation is really well made! So much that I thought it was worth an additional comment :+1:

Best regards,

1 Like

Very exciting experiments there, Samuel.

I just thought about something, which I don’t believe I seen any paper about…

What if I structure my sources like this:

Source1 <tag> Source2 <tag> Source3 = Target

  • Where each Source represent different languages.
  • <tag> is just a splitter so the model knows when we are looking at a diffrent source.
  • in all the segments the source need to remind exactly on the same spot and not get mixed.

Of course in the same segment all source would be a translation of the corresponding target.

Through data augmentation i would generate all combination and leave some Source blanks. In order that the model can works even if there is only one of the source provided. See below:
Source1 <tag> <tag> Source3 = Target
Source1 <tag> Source2 <tag> = Target
<tag> Source2 <tag> Source3 = Target
<tag> <tag> Source3 = Target
<tag> Source2 <tag> = Target
Source1 <tag> <tag> = Target

I believe that doing so, when I provid multi sources you would get extra information so the model get more “accurate”. This would be true especialy for feminin/masculin and plurials, but also for words that vary in context.

Any thoughs?

Hi @SamuelLacombe ,

A possible issue I see in this approach is the excessive length of the source sentences which could make training unmanageable.