Where do I get/do you get training data?

How do I get data to train OpenNmt?
I need mainly English and French but I never used AI before and I don’t really know where and how to start.

I donwloaded multiple files from http://opus.nlpl.eu but they come in xml.gz.tmp format and I don’t know how to process them.

Thanks for your help folks!

Hi,

You need to get some publicly available parallel corpus which sentence-algined, normalized and tokenized ( you can do on your own if you want).

You can find the corpus that was used for WMT-2017 (http://www.statmt.org/wmt17/) or the Europarl Corpus (http://www.statmt.org/europarl/) . or WMT’14 (http://www.statmt.org/wmt14/translation-task.html) .

So, for training the src-train.txt should contain the training data in source language , tgt-train.txt contain training data in target language. For, a descent nmt model this data should have atleast 2.5 m sentence pairs.

The validation texts, src-valid.txt and tgt-valid.txt should typically consist of around 500 to 3000 sentences. These validation texts are used for evaluating the convergence of the model. You can create validation texts on your own if its not available in your corpus, but just make sure the sentences are parallel-aligned, normalized and tokenized.

And the test data, could be something around 2500 sentences.

1 Like

Thanks, I get it now :smiley: !

If I need words about food and medical terms, can I just add more sentences at the end like:
Sliced potatoes, Lightly seasoned peanuts, etc to help train for this more specifically?

As someaditya already said, you need a bilingual or multilingual sentence-aligned corpus to train your model. If there is no data available for you domain then you can handle your case as low-resource translation and use monolingual training or decode methods:

http://www.statmt.org/wmt17/pdf/WMT15.pdf

http://www.nlpr.ia.ac.cn/cip/ZhangPublications/emnlp2016-jjzhang.pdf

A recent research favors monolingual back-translation to improve the bilingual model while training:

In the end you just could build a synthetic, pseudo corpus relying on existed NMT-systems and invest as much as you can in reducing the errors in the machine translations.

By the way the format xml.gz.tmp looks like a temporary archive. Did you lost connection while downloading?

1 Like

Yes you can, but you have to add it in both languages. You can even create a new corpus for re-training data. Like i used TED Talks transcripts to create a new corpus for Hindi-English.

In your case, you need to find a corpus which is about Food domain, or at-least relate to Food domain. My research has shown that even if you re-train with closely associated domain corpus , you can obtain favourable in-domain results. I can give you an idea, try to scrap bilingual data from food blog related apps like Yelp, Zomato and create a small corpus on your own.

There is a great package for getting aligned texts from opus https://pypi.org/project/opustools-pkg/ . First it offers you to download .gz files, then it parses the files and produces a two line aligned files if you chose moses format. Here is an example command:

opus_read -d ParaCrawl -s en -t ru
-rd ~/corpus/paracrawl/
-S 1 -T 1 -wm moses -ln -w c.clean.en, c.clean.ru ;

It requires a lot of resources to process large files. For example it fails if you attempt to download and process two datasets at a time. In my case it was also failing to process Paracrawl while there was a model training on GPU.

1 Like

When you go to opus and select to search for English - French resources, you will see a table with all resources in this language pair.
In order to download the parallel data, you should go for the links in the “Moses” column, where you will find the parallel aligned data.
If you prefer to get raw text, go for the links under the “raw” column and download the files for both languages.

1 Like