Detokenization clarification

vince62s · January 10, 2017, 2:36pm

Just for the sake of clarification.
Reading the guide:
If you activate sep_annotate marker, the tokenization is reversible - just use:
th tools/detokenize.lua [-case_feature] < file.tok > file.detok

if sep_annotate has not been used, then detokenize.lua does nothing ?

can’t we have a “standard” behavior for detokenizing a tokenized text ?
hope I’m clear

One example:
input:
Aujourd ’ hui , Airbus appelle directement le public à l ’ avance de , où on attend de l ’ qu ’ elle domine plus 100 ordres .
output:
no change …
apostrophe remains isolated, commas and periods too.

jean.senellart · January 10, 2017, 2:54pm

very clear! yes - detokenize.lua does nothing without annotation marker.

If you want to have standard “detokenization”, you actually want some quite of linguistic knowledge either in the tokenization or in the detokenization. For instance that question marks need space before in French and not in English. And this is the motivation of the reversible tokenization - you let the NN learns that for you and it works quite well… One of the reason of the improved performance is that you are passing more information to the NN than a classical tokenization: for instance knowing that the ambiguous single apostrophe was either free, stuck to a word to the left (probable right quote), to the right (probable left quote), or in between (probable apostrophe).

however, you can still use the old good Moses tokenization/detokenization schema - but all the tests we made show that we are not that easily able to outsmart the NN with a handful of non completely consistent tokenization rules.

vince62s · January 10, 2017, 3:00pm

understood, but then why is the -sep_annotate marker “optional” ?
I would have thought it has to be mandatory.

and can you please confirm that training and testing must be in the same mode (either with or without the marker in the training, respectively testing corpus ).

if so, I think the documentation needs to be more specific.

jean.senellart · January 10, 2017, 3:04pm

I let -sep_annotate marker optional, because this reversible tokenization is a small disruption to the main stream - and we wanted to avoid people surprise. But I understand it is confusing, and we should rather make it the default, otherwise the detokenize.lua seems non functional - I will do the change.
Yes training and testing should be in the same mode and I will add a note in the communication.

vince62s · January 10, 2017, 3:51pm

another question related to this post then Training English-German WMT15 NMT engine

Since you seem to tokenize without marker, it means you do not detokenize, hence do you calculate all BLEU scores on tokenized text ?

jean.senellart · January 10, 2017, 5:09pm

yes - we always calculate all BLEU on tokenized text. for the above mentioned run - the score has to be calculated on the tokenized form since it is only what we have. But more generally - you need to go back to detokenized state (in particular if you are using BPE, otherwise, score cannot even be compared) then retokenize. I would advise that for our scores (the ones in the benchmark platform), we use the simplest (conservative, no annotation) tokenization - so that we are the closest possible to standard tokenization and compatible to Moses-like tokenization (file>moses tok>onmt tok=file>onmt tok). does this make sense?

vince62s · January 10, 2017, 5:17pm

I could be mistaken, but I really think that Bleu scores reported in all competitions are detokenized scores.

When you look at the content of sgm test sets, you will clearly see that this is “normal text”.

Bleu score on tokenized text is almost always higher.

Personnally I would prefer to stick to detokenized text which is more natural.

@srush any insight ?

Edited:
look here: https://github.com/rsennrich/wmt16-scripts/blob/master/README.md
same thing…

jean.senellart · January 10, 2017, 9:10pm

ok we are not talking about the same thing, so my answer is not complete and not accurate. In all competitions, the submission is supposed to be retokenized/recased - but the score is calculated is on tokenized text - the tokenization is generally integrated inside the scorer - for instance mteval-1.3 has its own tokenization function. However, scripts like multi-bleu expect tokenized ref/output.

for our recipes (and it is what is part of the benchmark platform), let us use mteval-1.3 - so indeed we will use detokenized form.

vince62s · January 11, 2017, 8:46am

Jean,
one last point.
If we go with mteval-1.3 then we need sgm files.
You uploaded plain text test sets on s3.
Do you prefer to change your upload or me script the conversion txt to sgm for mteval-1.3 usage ?

EDIT: actually it does fly. I need a sgm file to wrap a text file.
My best guess is to use NIST sgm test set to start with, for newstest sets.

(I also use the plain text generic test along with the corpus to show in-domain results)

EDIT2: I think it is also better to keep the original name ie:
newstest2014-fren-src.en.sgm
newstest2014-fren-ref.fr.sgm

jean.senellart · January 11, 2017, 10:15am

yes - we will keep the name. For the sgml, I am not sure - mteval support text file, and I am not sure what is the added value of passing back to sgm?
Ideally, I would prefer keep a text version so that we can also apply all other metrics tools that are just text based. What do you think?

vince62s · January 11, 2017, 10:25am

ok to me.
My only reason for sgm was the original file being sgm and it’s a requirement for WMT.
http://www.statmt.org/wmt16/translation-task.html

since the base name for the corpus is “baseline-1M-xxyy”.tgz (maybe 2M later on)
I would suggest the base name for test sets “testsets-xxyy”.tgz