Improvement of performance by data normalization

Etienne38 · January 5, 2018, 9:25am

My underlying question was : how did you build your vocab in your experiments ?

silver · January 5, 2018, 11:21am

For example, I transform the date, url to $date$ and $url$ (unique vocab)

Etienne38 · January 5, 2018, 12:29pm

Of course, but how is built your vocab when you don’t replace them.

silver · January 5, 2018, 12:33pm

I replace them by ‘$date$’ and ‘$url$’, then they are vocabs

Etienne38 · January 5, 2018, 12:39pm

You said:

So, my question is, how did you build your vocab in the case where you DON’T replace them by a code ? Did you put the numbers inside the vocab ? If yes, you certainly got less real words in it.

silver · January 5, 2018, 12:45pm

when without, the number or date aren’t in vocab, they are replaced by UNK

Etienne38 · January 5, 2018, 12:51pm

If the numbers are not in the vocab when not replaced, I think it’s quite equivalent with the case where you are replacing them by a code.

silver · January 5, 2018, 1:13pm

that’s similar to my result of experiment, so I just wonder that it’s not really important to do normalization for date and number before building vocab for NMT

maybe someone has better way for normalization

Etienne38 · January 5, 2018, 1:25pm

You may try with richer normalisation. For example, rather than replacing 123.45 by a poor single $num$, or UNK, replace it by 888.88, that will keep an informative precision about the way it’s formatted.

jean.senellart · January 16, 2018, 2:46pm

Hi all,

to add up (a bit late) on this thread: yes, entity normalisation is important and even if you cannot except jump in your score, it will make a big difference for your users. For dealing with that, we have introduced monolingual preprocessing hooks and protected sequence to seamless deal with them.

In short - the process is the following:

define a mpreprocessing hook that will locate your favorite entities and annotate them with protected sequence markers - typically for uri:

check-out http://myurl.com/1234!

transform that into:

check out ｟URL：http://myurl.com/1234｠!

Note that there are 2 fields in the protected sequence separated by this strange ： character (it is not :):

the entity name URL
the actual value http://myurl.com/1234

This notation turns automatically the entity as a unique ｟URL｠ vocab, while the second field (the actual value), is used in the detokenization within inference to substitute the actual value.

Of course, you can also perform preprocessing outside of the OpenNMT code (i.e. without a hook), but defining it as a hook guarantees you that inference and training and identical, and you don’t need to add additional preprocessing layer in the inference code.

check New `hook` mechanism for more details on hooks, and http://opennmt.net/OpenNMT/tools/tokenization/ for more details on protected sequence!

tyahmed · April 16, 2018, 9:42am

Hi @jean.senellart

Does the encoding with ｟ - ｠ have the same behavior in OpenNMT-py?

jean.senellart · April 20, 2018, 10:11am

Hello @tyahmed, OpenNMT-py is agnostic regarding the tokenization, so you can use the same protection mechanism.
However, afaik there the lexical constraint mechanism is not implemented so you will not benefit from placeholder unicity generating, nor the actual pairing of source and target entities.

tyahmed · April 20, 2018, 12:35pm

Okay, thanks for the explanation. So the only way, with pytorch, is to encode the named entities with placeholders, train the model then post-process the data after the translation? (Post-processing using attention weights to find the corresponding source token for each placeholder in the target)

hanky · July 17, 2018, 11:46am

Hi @jean.senellart
Does the encoding with ((-)) also work in OpenNMT-tf?

anderleich · July 18, 2018, 7:46am

Hi,
I have tried to protect some entities like named entities, urls with the method mentioned in this post ｟URL：http://myurl.com/1234｠. It seems the notation is working for the tokenization and BPE processes. However, the model tries to translate ((URL)) to a number. For instance,

check out ｟URL：http://myurl.com/1234｠! --> check out 1956 !

What I’d like is the model to return ((URL)) in order to perform some kind of post-processing and recover the original url

jean.senellart · August 21, 2018, 3:44pm

Hello - this has been learned during training - so just make sure your training corpus is balanced after annotation and it should work fine!

anderleich · August 23, 2018, 3:20pm

Thanks for your answer @jean.senellart, however, I think I don’t really understand what you mean exactly. What do you mean by balanced corpus after annotation? Moreover, is the number I get correct, should I expect this behavior?

jean.senellart · August 23, 2018, 3:26pm

Hello, with placeholders - what the training is seeing are sentences like:

source blabla ｟URL｠source blabla => target blabla

note that the actual source sequence is hidden (the actual urls)
now if you have unbalanced sentences - where there are ｟URL｠in the source and not in the target (or not the same count), the NMT model will have to learn to translate a placeholder to something. for your training it happened to be a number coming from such unbalanced sentences in your training

So in other words:

preprocess your corpus to annotate with placeholders
filter-out all sentence pairs where you don’t have source/target mapping of placeholders
train!

anderleich · August 23, 2018, 4:43pm

Thanks! That’s awesome! That’s definitely the reason for my problem

ajitesh3 · August 16, 2019, 5:13pm

Hi jean @jean.senellart I am unable to use this hook aproach. Could you please elaborate it ?
My problem is with sentences like below:
SRC : This Section was not in the Act as originally enacted, but came into force by virtue of an Amendment Act of 2009 with effect from 27.10.2009.

But the translation of above sentence contains date(27.10.2009) first then the number 2009.
If I have multiple dates for instance, how will my placeholder know which date is to be replaced?
Kindly correct me wherever I am wrong