Training data for Live System

Is it possible to know, please, which and how much training data have been used for the Dutch-English pair in the Live System? Some of the output is very very good :-).

We’re glad to hear it!
The size of training data was about 2.1M sentences including mostly Europarl, JRC-Acquis, EAC, Tatoeba and Ted Talk corpora.

Hi @arebollo-systran !

I’m impressed how well live demo works for Polish=>English translation. In my opinion it currently outperforms Google Translate (it’s still SMT for that language pair) and is on par with Microsoft Translator.

I’m curious if live demo uses any closed-source changes in preprocess/train/translate code or is it vanilla OpenNMT installation trained with publicly available parallel corpora?

For Polish-English there are few parallel corpora. How many sentences are you using for training that language pair?

Thanks for your great work!


1 Like

Hi @arebollo-systran
Do you have any comparative data between your SMT engines and your “pure” NMT engine? Seeing that you (probably) used the same training data that would be interesting.

Hello Bartek,
Thank you for your encouragement !
For Polish>English the corpus was around 1.7 million sentences, mostly coming from publicly available sources.
As for the code, we do use some in-house preprocess and translation layers.

Hi Terence,
Actually, the data used is often different between our SMT and NMT systems (selection, cleaning process, tokenization) which means that we don’t really have engines that are entirely comparable.
But indeed, it needs to be done…

I would like to know the specification of the corpus which is used to train Persian-English Translation such as size, number of sentences, the source of the Persian sentences and any other useful information. Also, I would like to know if it is possible for you to share the Persian-English Corpus with me. I am an Assistant Professor and I want to use it only for educational purpose. I can give you more information if needed. Many thanks.