Multilingual model

SamuelLacombe · January 23, 2022, 1:49am

Hello,

I’ve been training lots of models (ENG to lots of languages), but now I need to do the other way around for all of those languages.

What would be the best approach?

One Multilingual model for all of them where I use a token before each segment to represent the input source?
Many multilangual model?

And what is the best way to build the validation set with a multingual model? (5000 segments split among all languages?)

Best regards,
Samuel

ymoslem · January 23, 2022, 2:21am

Hi Samuel,

I am sure you are aware of mBART. In the relevant paper, the authors tried to answer similar questions.

All the best,
Yasmin

argosopentech · January 23, 2022, 11:26am

Using a token at the beginning seems like a straightforward way to do multilingual models. In my experience it’s been difficult to train an OpenNMT model with multiple types of functionality.

<to-es> Hello World -> Hola Mundo

SamuelLacombe · January 23, 2022, 2:35pm

Hello Yasmin,

Indeed I seen that before! but I wasn’t sure if anyone had experienced it with OpenNMT.

Argo seem to answer that part of the question! But i might try anyway since the data required for both approach is the same.

And for the validation set I will try to increase to 10k and randomly take from each of the validation set per language. (using 20 languages or so)

ymoslem · January 23, 2022, 3:42pm

Hi Samuel!

Another point to take into consideration is dataset weights, if the data size is not the same for each language. In this case, you have to apply over-sampling. See the discussion here.

I also collected some notes here:

This would be a very interesting experiment; please keep us posted.

All the best,
Yasmin

SamuelLacombe · January 23, 2022, 8:59pm

I will read all this and give feedback when i have done my testing!

Might take fews months as its not my immediate priority, but i like to figure out ahead my next steps. So i can meditate them and upgrade them even before starting coding

SamuelLacombe · January 23, 2022, 9:21pm

Hello Yasmin,

Your documentation is really well made! So much that I thought it was worth an additional comment

Best regards,
Samuel

tel34 · January 24, 2022, 12:22am

Very exciting experiments there, Samuel.

SamuelLacombe · January 26, 2022, 11:30pm

I just thought about something, which I don’t believe I seen any paper about…

What if I structure my sources like this:

Source1 <tag> Source2 <tag> Source3 = Target

Where each Source represent different languages.
<tag> is just a splitter so the model knows when we are looking at a diffrent source.
in all the segments the source need to remind exactly on the same spot and not get mixed.

Of course in the same segment all source would be a translation of the corresponding target.

Through data augmentation i would generate all combination and leave some Source blanks. In order that the model can works even if there is only one of the source provided. See below:
Source1 <tag> <tag> Source3 = Target
Source1 <tag> Source2 <tag> = Target
<tag> Source2 <tag> Source3 = Target
<tag> <tag> Source3 = Target
<tag> Source2 <tag> = Target
Source1 <tag> <tag> = Target

I believe that doing so, when I provid multi sources you would get extra information so the model get more “accurate”. This would be true especialy for feminin/masculin and plurials, but also for words that vary in context.

Any thoughs?

panosk · January 28, 2022, 12:56pm

Hi @SamuelLacombe ,

A possible issue I see in this approach is the excessive length of the source sentences which could make training unmanageable.

SamuelLacombe · November 26, 2023, 5:36am

Hello,

It’s been a while, but I have some news.

So I managed to create a multi lang model that support (n*(n-1)) language pairs. In my case, n = 61. So my model support 3660 languages pairs.

My data corpus are between medium/small/tiny.

My Original data source is English and all other languages are translated from the English. So they all share the same source.

I have generated all possible sentence alignments between all the various languages pairs, and I have assigned the weights in a way that all pairs are in equal quantity during training.

After getting the results, I picked 6 languages pairs that had (the most, the lowest and in between). For each one of them, I trained a single model with the exact same data that was used in the multi langs model (same training/validation/testing files).

Note: the multi langs model was a Big Transformer and the single models are Transformer.

I used Sentence Piece with about the same amount of tokens as the model Bert mutli langs.

Results:
The multi model was really good for all pairs that I tried.
Upon validation, the language pairs with the most data were performing about 6 Bleu score points lower than the single model I trained. The middle ones were on pare and the ones with the least data were performing sometimes 20 points above single model Bleu Score.

My current conclusion is that I should try to give weight proportional to the number of sentences for each pair. I should also reduce the number of token in my sentence piece configuration as I noticed that most word were tokenized has a single word. But I want the tokenization to cross tokenize the languages so that the model can build on top of those similarities.

I will keep updating has I work my way. If anyone has suggestion make sure to let me know!

Annexe:

This graph shows my language pair with the most data BLEU score (multi vs single model) About 2000 sentences were used to test.

This graph shows a mid-size language pair (multi vs single model) About 2000 sentences were used to test.

This graph shows one of the smallest language pair (multi vs single model) About 2000 sentences were used to test.