Improvement of performance by data normalization



For SMT system, it’s important to normalize the number, date, time and url into unique representation.

But for NMT, I trained the model with and without data normalization between English and French. I didn’t see a big improvement in BLEU score or real test.

In the live NMT system (like Systran or Google), data normalization is still applied ? or they use some better way to do the normalization ?


(Etienne Monneret) #2

The main problem with NMT is the fact that the vocab is limited. If you don’t put numbers and dates in the vocab, it’s similar to have a normalisation, since they will be considered as unknown (single token <unk>). If you put numbers and dates in the vocab, if your text is full of them, you will use a large part of your vocab, just for them, while a lot of real words will be considered a unknown.

(johnsmith) #3

Data normalization is very important as it makes the data analyses very easy. But unfortunately, I am facing serious problems in the same. If anyone is available with the correct procedure for data normalisation, please share it with us here.

Google Customer Service


Thank for your reply.

I agree with you, in NMT, the date, url or number should be considered as UNK to limit the size of vocab

My question is, could we get big improvement by transforming these parts (for example, transform all url to $url, a unique vocab)?

Because from my experiment, there aren’t great improvement, maybe the problem of my configuration or process the normalization.

(Etienne Monneret) #5

My underlying question was : how did you build your vocab in your experiments ?


For example, I transform the date, url to $date$ and $url$ (unique vocab)

(Etienne Monneret) #7

Of course, but how is built your vocab when you don’t replace them.


I replace them by ‘$date$’ and ‘$url$’, then they are vocabs

(Etienne Monneret) #9

You said:

So, my question is, how did you build your vocab in the case where you DON’T replace them by a code ? Did you put the numbers inside the vocab ? If yes, you certainly got less real words in it.


when without, the number or date aren’t in vocab, they are replaced by UNK

(Etienne Monneret) #12

If the numbers are not in the vocab when not replaced, I think it’s quite equivalent with the case where you are replacing them by a code.


that’s similar to my result of experiment, so I just wonder that it’s not really important to do normalization for date and number before building vocab for NMT

maybe someone has better way for normalization

(Etienne Monneret) #14

You may try with richer normalisation. For example, rather than replacing 123.45 by a poor single $num$, or UNK, replace it by 888.88, that will keep an informative precision about the way it’s formatted.

(jean.senellart) #15

Hi all,

to add up (a bit late) on this thread: yes, entity normalisation is important and even if you cannot except jump in your score, it will make a big difference for your users. For dealing with that, we have introduced monolingual preprocessing hooks and protected sequence to seamless deal with them.

In short - the process is the following:

  • define a mpreprocessing hook that will locate your favorite entities and annotate them with protected sequence markers - typically for uri:

transform that into:

check out ⦅URL:⦆!

Note that there are 2 fields in the protected sequence separated by this strange character (it is not :):

  • the entity name URL
  • the actual value

This notation turns automatically the entity as a unique ⦅URL⦆ vocab, while the second field (the actual value), is used in the detokenization within inference to substitute the actual value.

Of course, you can also perform preprocessing outside of the OpenNMT code (i.e. without a hook), but defining it as a hook guarantees you that inference and training and identical, and you don’t need to add additional preprocessing layer in the inference code.

check New `hook` mechanism for more details on hooks, and for more details on protected sequence!