I just thought about something, which I don’t believe I seen any paper about…
What if I structure my sources like this:
Source1 <tag> Source2 <tag> Source3 = Target
Where each Source represent different languages.
<tag> is just a splitter so the model knows when we are looking at a diffrent source.
in all the segments the source need to remind exactly on the same spot and not get mixed.
Of course in the same segment all source would be a translation of the corresponding target.
Through data augmentation i would generate all combination and leave some Source blanks. In order that the model can works even if there is only one of the source provided. See below:
Source1 <tag><tag> Source3 = Target
Source1 <tag> Source2 <tag> = Target <tag> Source2 <tag> Source3 = Target <tag><tag> Source3 = Target <tag> Source2 <tag> = Target
Source1 <tag><tag> = Target
I believe that doing so, when I provid multi sources you would get extra information so the model get more “accurate”. This would be true especialy for feminin/masculin and plurials, but also for words that vary in context.