The OpenNMT site described an example set of training data for german. It has you do a wget “https://s3.amazonaws.com/opennmt-trainingdata/wmt15-de-en.tgz”. How do you get training data for other languages. I can’t do a wget “https://s3.amazonaws.com/opennmt-trainingdata/*.tgz”. I want to convert from french, italian, portuguese and spanish for publications with liited and custom vocabularies
You can try European Parliament Proceedings Parallel Corpus and The open parallel corpus
Trainning data may be the hardest problem when you want to construct a MT System for production. We have to prepare it by all ourselves in most situation.
Good luck !
1 Like
We have to prepare it all by ourselves in most situations
This is exactly what I’m doing to build an Indonesian<>English corpus. The open parallel corpus is great for R&D purposes but you can’t have a “production translation” reading like a character in an American movie :-). I’m assembling a corpus from many different sources, and even hand crafting sentences from standard grammars and primers with exercises & answers.