Where do I get/do you get training data?


(Jeremy) #1

How do I get data to train OpenNmt?
I need mainly English and French but I never used AI before and I don’t really know where and how to start.

I donwloaded multiple files from http://opus.nlpl.eu but they come in xml.gz.tmp format and I don’t know how to process them.

Thanks for your help folks!


(Some Aditya Mandal) #2

Hi,

You need to get some publicly available parallel corpus which sentence-algined, normalized and tokenized ( you can do on your own if you want).

You can find the corpus that was used for WMT-2017 (http://www.statmt.org/wmt17/) or the Europarl Corpus (http://www.statmt.org/europarl/) . or WMT’14 (http://www.statmt.org/wmt14/translation-task.html) .

So, for training the src-train.txt should contain the training data in source language , tgt-train.txt contain training data in target language. For, a descent nmt model this data should have atleast 2.5 m sentence pairs.

The validation texts, src-valid.txt and tgt-valid.txt should typically consist of around 500 to 3000 sentences. These validation texts are used for evaluating the convergence of the model. You can create validation texts on your own if its not available in your corpus, but just make sure the sentences are parallel-aligned, normalized and tokenized.

And the test data, could be something around 2500 sentences.


(Jeremy) #3

Thanks, I get it now :smiley: !

If I need words about food and medical terms, can I just add more sentences at the end like:
Sliced potatoes, Lightly seasoned peanuts, etc to help train for this more specifically?


(Bachstelze) #4

As someaditya already said, you need a bilingual or multilingual sentence-aligned corpus to train your model. If there is no data available for you domain then you can handle your case as low-resource translation and use monolingual training or decode methods:

http://www.statmt.org/wmt17/pdf/WMT15.pdf

http://www.nlpr.ia.ac.cn/cip/ZhangPublications/emnlp2016-jjzhang.pdf

A recent research favors monolingual back-translation to improve the bilingual model while training:

In the end you just could build a synthetic, pseudo corpus relying on existed NMT-systems and invest as much as you can in reducing the errors in the machine translations.


(Bachstelze) #5

By the way the format xml.gz.tmp looks like a temporary archive. Did you lost connection while downloading?


(Some Aditya Mandal) #6

Yes you can, but you have to add it in both languages. You can even create a new corpus for re-training data. Like i used TED Talks transcripts to create a new corpus for Hindi-English.

In your case, you need to find a corpus which is about Food domain, or at-least relate to Food domain. My research has shown that even if you re-train with closely associated domain corpus , you can obtain favourable in-domain results. I can give you an idea, try to scrap bilingual data from food blog related apps like Yelp, Zomato and create a small corpus on your own.