Getting training data for other examples

jesmitty · August 23, 2017, 3:17pm

The OpenNMT site described an example set of training data for german. It has you do a wget “https://s3.amazonaws.com/opennmt-trainingdata/wmt15-de-en.tgz”. How do you get training data for other languages. I can’t do a wget “https://s3.amazonaws.com/opennmt-trainingdata/*.tgz”. I want to convert from french, italian, portuguese and spanish for publications with liited and custom vocabularies

huache · August 24, 2017, 3:32am

You can try European Parliament Proceedings Parallel Corpus and The open parallel corpus

Trainning data may be the hardest problem when you want to construct a MT System for production. We have to prepare it by all ourselves in most situation.

Good luck !

tel34 · August 25, 2017, 8:18am

We have to prepare it all by ourselves in most situations
This is exactly what I’m doing to build an Indonesian<>English corpus. The open parallel corpus is great for R&D purposes but you can’t have a “production translation” reading like a character in an American movie :-). I’m assembling a corpus from many different sources, and even hand crafting sentences from standard grammars and primers with exercises & answers.