Using a token at the beginning seems like a straightforward way to do multilingual models. In my experience it’s been difficult to train an OpenNMT model with multiple types of functionality.
Another point to take into consideration is dataset weights, if the data size is not the same for each language. In this case, you have to apply over-sampling. See the discussion here.
I also collected some notes here:
This would be a very interesting experiment; please keep us posted.
I will read all this and give feedback when i have done my testing!
Might take fews months as its not my immediate priority, but i like to figure out ahead my next steps. So i can meditate them and upgrade them even before starting coding
I just thought about something, which I don’t believe I seen any paper about…
What if I structure my sources like this:
Source1 <tag> Source2 <tag> Source3 = Target
Where each Source represent different languages.
<tag> is just a splitter so the model knows when we are looking at a diffrent source.
in all the segments the source need to remind exactly on the same spot and not get mixed.
Of course in the same segment all source would be a translation of the corresponding target.
Through data augmentation i would generate all combination and leave some Source blanks. In order that the model can works even if there is only one of the source provided. See below:
Source1 <tag><tag> Source3 = Target
Source1 <tag> Source2 <tag> = Target <tag> Source2 <tag> Source3 = Target <tag><tag> Source3 = Target <tag> Source2 <tag> = Target
Source1 <tag><tag> = Target
I believe that doing so, when I provid multi sources you would get extra information so the model get more “accurate”. This would be true especialy for feminin/masculin and plurials, but also for words that vary in context.