OpenNMT Forum

CCMatrix: A billion-scale bitext data set for training translation models

Wanted to bring this to attention if people have not seen it:

ps: to clarify, I am not afflicted to fb or have any contribution to this work- just thought fellow OpenNMT users would find it useful

We are working on a similar task.

We will actually post some results on a multilingual 5 languages (20 pairs). Interesting, better than pivot, slightly lower than regular pair training.

The quality isn’t that high, regarding the “technical error” of the system from facebook. They can’t be of high quality because the sentences never meant to be translations to each other. In contrast to transcribed tedTalks or translated documents which should be focused by parallel text miners. I think that the research result only shows that the compared systems can be seen as low data training.