Quickstart - Should the output from the quick start guide be quite this inaccurate?

matt · March 23, 2018, 4:03pm

I have followed the nice quickstart guide successfully. The training create 13 demo-model pt files. I have tried then following the next step in the quickstart guide (step 3), which is to then try to translate the text in the sample file.

It says in the quickstart guide that the output will be very inaccurate, due to the training set being so small, but I was wondering if it should be quite this in accurate? E.g. pretty much every single word completely wrong.

For example:

SENT 2543: (u’Teams’, u’and’, u’individual’, u’competitors’, u’will’, u’be’, u’pitted’, u’against’, u’one’, u’another’, u’in’, u’the’, u’disciplines’, u’of’, u’tossing’, u’the’, u’caber’, u’,’, u’horseshoe’, u’tossing’, u’and’, u’bucket’, u’carrying’, u’.’)
PRED 2543: In der Nähe des Europäischen Parlaments , die in der Nähe des Europäischen Parlaments .
PRED SCORE: -24.4164

Thanks for your time, and just to stress, this is a new area for me.

miguelknals · March 23, 2018, 6:53pm

Hi!

I think so, I get another one

SENT 2543: Teams and individual competitors will be pitted against one another in
the disciplines of tossing the caber , horseshoe tossing and bucket carrying .
PRED 2543: Mit der Nähe des Hotels und der Insel und der Stadt , die in der
Nähe des Hotels in der Nähe der Welt .

PRED SCORE: -59.86

But your first step is correct. This path (from source to translation) many times in another popular solutions is much more tricky.

I you dive a little bit in the forum you will be able to achivie much more “normal” results.

Hope this helps
have a nice day
miguel canals

matt · March 23, 2018, 7:48pm

Thanks for the reply. That is a relief, I thought I had done something wrong!

A couple of questions, if you don’t mind:

Any particular thing you could point me towards in the forum so that I can learn more and get towards decent potential translations?
Also, the training in the quickstart creates many different output files. Is there a reason for this? And how is it possible to know which to use if I should use only one?

miguelknals · March 24, 2018, 7:05pm

Hi matt,

You are “almost” done.

Why dont you have a look to the tutorial I told you in my previous append? If you do not want to use the same data (the WMT 15 NMT corpus), is much more fun use other one. If you have one, try to use it, if not you can google, you will find many them. Have a look at http://opus.nlpl.eu/ to have some ideas. But there are much more.

From the point view of a non expert user, I think you need:

-Data (a corpus) to feed in openNMT.
-Prepare the data (tokenization) in order to properly feed OpenNMT
-Run and satisfy the hard/soft for OpenNMT with your corpus (gpu or not)
-Transform the result data in user data (detokenization)
-Find out if the results are enough good to you (with some metric or just your judgment)
-Find ways to improve results
-Design a flow for your translation.

Usually is not a straightforward process, It is more a loop learning process, repeating these steps, as each step you complete, will raise questions you would like to answer. Sorry to say I am trapped right now in this loop!

If I am not wrong, OpenNMT iterates the data in several steps (epochs). You should dive in the documentation. The loop continues till some criteria is reach (by default is run 13 epochs). Each epoch is a fully usable result for translation.

Have a nice day!
miguel canals