Training data for Live System

tel34 · January 30, 2017, 10:28am

Hi,
Is it possible to know, please, which and how much training data have been used for the Dutch-English pair in the Live System? Some of the output is very very good :-).
Terence

arebollo-systran · January 31, 2017, 10:23am

Hello,
We’re glad to hear it!
The size of training data was about 2.1M sentences including mostly Europarl, JRC-Acquis, EAC, Tatoeba and Ted Talk corpora.
Anabel

bartek · February 2, 2017, 3:39pm

Hi @arebollo-systran !

I’m impressed how well live demo works for Polish=>English translation. In my opinion it currently outperforms Google Translate (it’s still SMT for that language pair) and is on par with Microsoft Translator.

I’m curious if live demo uses any closed-source changes in preprocess/train/translate code or is it vanilla OpenNMT installation trained with publicly available parallel corpora?

For Polish-English there are few parallel corpora. How many sentences are you using for training that language pair?

Thanks for your great work!

Bartek

tel34 · February 21, 2017, 10:32am

Hi @arebollo-systran
Do you have any comparative data between your SMT engines and your “pure” NMT engine? Seeing that you (probably) used the same training data that would be interesting.
Thanks,
Terence

arebollo-systran · February 24, 2017, 4:50pm

Hello Bartek,
Thank you for your encouragement !
For Polish>English the corpus was around 1.7 million sentences, mostly coming from publicly available sources.
As for the code, we do use some in-house preprocess and translation layers.
Thanks,
Anabel

arebollo-systran · February 24, 2017, 5:01pm

Hi Terence,
Actually, the data used is often different between our SMT and NMT systems (selection, cleaning process, tokenization) which means that we don’t really have engines that are entirely comparable.
But indeed, it needs to be done…
Anabel

nrazavi · March 10, 2017, 5:17pm

I would like to know the specification of the corpus which is used to train Persian-English Translation such as size, number of sentences, the source of the Persian sentences and any other useful information. Also, I would like to know if it is possible for you to share the Persian-English Corpus with me. I am an Assistant Professor and I want to use it only for educational purpose. I can give you more information if needed. Many thanks.