-phrase_table option

Hi, As I have a Dutch-English/English-Dutch dictionary of some 300,000 words available I am finding this feature an extremely useful way of avoiding OOV’s. As Dutch has many compound nouns that translate into two or three tokens (i.e. systeemontwikkeling = system development) I have got round the single token limitation by replacing the space with an underscore (i.e. system_development) , the underscore is then removed in my client application.

3 Likes

I’m now having problems with this feature in release v0.7. In earlier releases it did its job and replaced a with the appropriate target token in the phrase table (with many language pairs). I noticed this issue today as I had added all the written forms of the numbers (English>Dutch) up to a hundred to my phrase table. For the sentence “This is better than fifty-one experts” I am now getting “Dit is beter dan dan deskundigen” (i.e. attention picks the nearest source word “than” and translates it as “dan”). Has anything changed?

The attention is not always reliable. Unless the translation result is different than a previous version, it’s not really surprising.

I’m still puzzled by the behaviour of the apparently erratic phrase_table option. With a Malay-English model built with default settings, I take the word “woksyop” which does NOT occur in the training data and add it to my phrase table. It is then correctly translated as “workshop” at inference. However, with various English-Dutch models English compound numbers like twenty-three which are NOT included in the Source Vocabulary but ARE included in the backoff dictionary are not being translated and being replaced by the nearest source word. Any ideas? This is embarrassing when demonstrated this software to potential customers! Thanks.

The phrase table relies on the attention module that is learned during the training so errors are to be expected, just like other translation errors.

More generally, it is not the best to use the phrase table to handle numeric entities.

1 Like

Hi Terence!

Have you tried to build your models using any subword segmentation (bpe, morfessor)?
I guess that using that you will see the out-ouf-vocabulary problem reduced.
However, I think you should adapt your phrase table to the subword segmentation too, because the attention module (the one which is generating the soft alignments) now will be working at subword level.

Also, if you use that, don’t forget to include a little post-process step to reattach the words. Notice that the input will have the form 'this is a sen* ten* ce pro* of’
and the output will be something similar to, for instance,
‘est* o es un* a fras* e de prueb* a’ in Spanish.

Additionally, I would suggest you to implement a little bit more sophisticated post-process in order to detect those weird words created when reconstructing the subword translation, maybe you can use an extended dictionary or a phrase table to do so.
Notice that now the translation errors will be made at subword level. So, if you have as input ‘this is a sen* ten* ce pro* of’ the system can produce
’est* o es un* alg* un* a prueb* a’
which attached results in: 'esto es unalguna prueba’
producing the word ’unalguna’ which does not exist in Spanish.

Regarding to numbers, as @guillaumekln told you here:


you can pre-process the training data to use placeholders for numbers and, after decoding, just substitute them for the corresponding source number (this may be the simplest approach). Remember that if you use this technique, the input of your system must have the placeholders for numbers too.

I hope that can help you :slight_smile:

1 Like

Hi Eva,
Thanks, there’s some useful stuff there :-). On the Dutch2English side I’ve introduced my own Splitter (a kind of subword segmentation) and that solves many of the OOV issues including numerical entities. On the English2Dutch side I notice that if the number has been included in the Source Vocabulary it is translated, so “I have twenty-two friends” is handled correctly, but “I have thirty-two friends” is not (twenty-two being in the Vocabulary and thirty-two being OOV). I am not sure how subword segmentation would solve this as Dutch needs to reverse the two numbers, i.e. two-and-thirty?
I have been thinking of tackling this problem as an in-domain training problem and creating a hundred source & target sentences with numerical entities from one to a hundred. I’ll report back.
Terence

On the English2Dutch side I notice that if the number has been included in the Source Vocabulary it is translated, so “I have twenty-two friends” is handled correctly, but “I have thirty-two friends” is not (twenty-two being in the Vocabulary and thirty-two being OOV). I am not sure how subword segmentation would solve this as Dutch needs to reverse the two numbers, i.e. two-and-thirty?

It is expected that the subword model will learn to split numbers like two-* and-* thirty, and the NMT system will learn then to produce the entire number sequence in a good way -the same as it has seen in the training data-, so I guess the NMT model will learn to produce the reversed translation you want to appear for Dutch.

But, of course, it is all depending on the training data. If your training data don’t contain this kind of sequences of numbers, neither the subword model nor the NMT system will learn to segment and translate the numbers in the way you want. In fact, I think this is the phenomenon you are observing with the translation of “twenty-two” and “thirty-two”. In the latter case, the system does not know how to handle the word (or the related subwords) because it does not have seen them during training, so it left it handled as an OOV using the attention information which, as @guillaumekln said, it can produce errors.

good luck!

I’ve an interesting observation about the phrase table option for dealing with OOV’s.
In my translation “The girl eats an apple” I see the model has learned to use “an” before apple instead of “a”, which is nearly always correct usage in English: a -> an before [aeiou].* with a very few exceptions.
“Apple” is “in vocabulary” and indeed the noun phase “an apple” occurs in the training data.
But I also have a translation “He suffers from a abscess”. The Dutch “etterbuil” is OOV so the translation “abscess” was taken from the phrase table. The use of “a” before “abscess” is wrong. Am I therefore right to assume that the rules the model has learned are not being applied to a word taken from the phrase table to replace an OOV token?
I have built a server which sits between the client/plug-in and the rest_translation_server to deal with such minor issues, like compound splitting and numerical entity handling.

The networks isn’t learning such “a -> an before [aeiou].*” rule. The main first reason is that, internally, words are vector encoded (embeddings), and not used in their alphabetical forms. You could have any text/code in place of a word, it would be exactly the same in the way the network is learning the sentences. For the network, each word is only, at best, the number it get in the dict files.

This said, unknown words are replaced afterwards, when the translation is completely done. The phrase table isn’t used during the translation process itself. This replacement is only possible if:

  1. the translator really put an unknown word in the resulting translation. In fact, even with unknown words in input, it can produce only known words in its translation. Think of a simple fact : known words can be different from input and output languages.
  2. the internal attention module is properly associating the unknown word in output with the right word in input
  3. this input word is in the phrase table
2 Likes

As “etterbuil” is OOV, the model generates “a <unk>” because it is more likely to be correct than “an <unk>” given the information available to the model.

With subword tokenization like BPE, this issue would certainly not appear.

2 Likes

This is quite fascinating. If I request “De etterbuil van mijn vader is groot”, I get
"My father’s abscess is big" and NOT “The abscess of my father is big.” So the network has cleverly learned how to form the English possessive and applies it to what is then an unknown token.

It didn’t really learn how to build a possessive. It’s more ‘basic’ than this. It rather learned that such kind of tokens arrangement (including possibly an unk token) must be transformed in this other kind of tokens arrangement (including possibly an unk token), because statistically, it was done like this in examples.
:slight_smile:

Am aware the network doesn’t “learn grammar”, was just expressing what “appears” to be happening :slight_smile:

1 Like

I woud have been great if this fact is put in some documentation. from documentation i got an understand that OOV are checked for during the translation process. But as mentioned here, it is scanned only when model post attention gives an UNK. I still am figuring out how to get this done during the translation process itself. any pointers on where i should look for the code would be helpful.

Hi Guillaume,
I am somehow not able to use -phrase_table functionality, and thus need help regarding it.
let us say there is a word in my test file, ‘srijan’ , which is not there in the training/validation dataset.

My phrase table:
srijan ||| सृजन
srijan . ||| सृजन .
srijan .|||सृजन .
Srijan|||सृजन

My src-test file:
srijan
Srijan .
Srijan .
Hello .
home ground

Command used:
th translate.lua -replace_unk true -phrase_table data/PhraseTable.txt -model model_epoch50_22.66.t7 -src data/src-test1.txt -output pred1.txt

Predicted word:
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।

According to me, the model is somehow not able to look in the phrase table and instead is giving the output ‘हैदराबाद ।’ which probably got the highest attention.
I want the translated output to be ‘सृजन’ instead.

Any help would be highly appreciated.
Thanks in advance.

Hi Terence,
I am somehow not able to use -phrase_table functionality, and thus need help regarding it.
let us say there is a word in my test file, ‘srijan’ , which is not there in the training/validation dataset.

My phrase table:
srijan ||| सृजन
srijan . ||| सृजन .
srijan .|||सृजन .
Srijan|||सृजन

My src-test file:
srijan
Srijan .
Srijan .
Hello .
home ground

Command used:
th translate.lua -replace_unk true -phrase_table data/PhraseTable.txt -model model_epoch50_22.66.t7 -src data/src-test1.txt -output pred1.txt

Predicted word:
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।

According to me, the model is somehow not able to look in the phrase table and instead is giving the output ‘हैदराबाद ।’ which probably got the highest attention.
I want the translated output to be ‘सृजन’ instead.

Any help would be highly appreciated.
Thanks in advance.

Hi Vermasrijan,
I have just tried this again. In my Indonesian-English model I type “Saya makan alpukat di restoran”. I know that “alpukat” is - strangely - not in the training data. The model gives me “I eat avocado in the restaurant” taking the translation of avocado from the phrase table. My only difference with your command is that I put the phrase_table option before -replace_unk. Did you try your test with slightly longer sentences?
Terence

Hi @vermasrijan!

I would check your phrase table entries.
If you look at the OpenNMT-lua documentation you’ll see that each entry in the phrase table should be:

source|||target

where source and target are case sensitive and single tokens.
Thus, for instance, your entry srijan ||| सृजन should be srijan|||सृजन .

So, in your example, using your phrase table as it is, only the word form Srijan would be translated properly (as you want).
But, keep in mind that the phrase table is applied on the source word retrieved with the highest attention probability, so it could be that the source word you want to tackle in a particular sentence doesn’t match the one considered by the model.

Best,
Eva

Thanks for replying @emartinezVic.
@emartinezVic, I was wondering if there a way in which I will be able to look in the phrase_table directly, bypassing attention mechanism (for the OOV words in my test data) .
For eg: Let us say, I will be having some ‘N’ words (OOV words) in every 3 months. (these words would not be trained with the training set). I want to make a phrase table for these ‘N’ words so that I directly look in the table to get their translation.
Basically, I am just looking for a way to deal with OOV words.
I thought of 3 approaches to deal with this:
First, to use the phrase table, which OpenNMT provides.
Second, to capture all the OOV words from my test data, send them to some other translator (like Google Translator, since gnmt has been trained on some billions of data) and then get their corresponding translation.
Third, if an OOV word is there in my test set, somehow find a way to copy-paste the exact source OOV word in my target translation (the position of the OOV word in my translation should be structurally correct, in reference with the rest of the words in the sentence).

Having insights on any of the three approaches^ would be really helpful to me.

Thanks in advance!
Srijan