In my recent practice at openNMT, -phrase_table option did not replace all OOV words, some OOV word often lost (if first word is OOV word), and it can not directly output oov word, so I gave up using this feature.
I think that before the exact or direct output oov words, this feature can not really be used.
Hi, As I have a Dutch-English/English-Dutch dictionary of some 300,000 words available I am finding this feature an extremely useful way of avoiding OOV’s. As Dutch has many compound nouns that translate into two or three tokens (i.e. systeemontwikkeling = system development) I have got round the single token limitation by replacing the space with an underscore (i.e. system_development) , the underscore is then removed in my client application.
I’m now having problems with this feature in release v0.7. In earlier releases it did its job and replaced a with the appropriate target token in the phrase table (with many language pairs). I noticed this issue today as I had added all the written forms of the numbers (English>Dutch) up to a hundred to my phrase table. For the sentence “This is better than fifty-one experts” I am now getting “Dit is beter dan dan deskundigen” (i.e. attention picks the nearest source word “than” and translates it as “dan”). Has anything changed?
I’m still puzzled by the behaviour of the apparently erratic phrase_table option. With a Malay-English model built with default settings, I take the word “woksyop” which does NOT occur in the training data and add it to my phrase table. It is then correctly translated as “workshop” at inference. However, with various English-Dutch models English compound numbers like twenty-three which are NOT included in the Source Vocabulary but ARE included in the backoff dictionary are not being translated and being replaced by the nearest source word. Any ideas? This is embarrassing when demonstrated this software to potential customers! Thanks.
Have you tried to build your models using any subword segmentation (bpe, morfessor)?
I guess that using that you will see the out-ouf-vocabulary problem reduced.
However, I think you should adapt your phrase table to the subword segmentation too, because the attention module (the one which is generating the soft alignments) now will be working at subword level.
Also, if you use that, don’t forget to include a little post-process step to reattach the words. Notice that the input will have the form 'this is a sen* ten* ce pro* of’
and the output will be something similar to, for instance, ‘est* o es un* a fras* e de prueb* a’ in Spanish.
Additionally, I would suggest you to implement a little bit more sophisticated post-process in order to detect those weird words created when reconstructing the subword translation, maybe you can use an extended dictionary or a phrase table to do so.
Notice that now the translation errors will be made at subword level. So, if you have as input ‘this is a sen* ten* ce pro* of’ the system can produce ’est* o es un* alg* un* a prueb* a’
which attached results in: 'esto es unalguna prueba’
producing the word ’unalguna’ which does not exist in Spanish.
you can pre-process the training data to use placeholders for numbers and, after decoding, just substitute them for the corresponding source number (this may be the simplest approach). Remember that if you use this technique, the input of your system must have the placeholders for numbers too.
Thanks, there’s some useful stuff there :-). On the Dutch2English side I’ve introduced my own Splitter (a kind of subword segmentation) and that solves many of the OOV issues including numerical entities. On the English2Dutch side I notice that if the number has been included in the Source Vocabulary it is translated, so “I have twenty-two friends” is handled correctly, but “I have thirty-two friends” is not (twenty-two being in the Vocabulary and thirty-two being OOV). I am not sure how subword segmentation would solve this as Dutch needs to reverse the two numbers, i.e. two-and-thirty?
I have been thinking of tackling this problem as an in-domain training problem and creating a hundred source & target sentences with numerical entities from one to a hundred. I’ll report back.
On the English2Dutch side I notice that if the number has been included in the Source Vocabulary it is translated, so “I have twenty-two friends” is handled correctly, but “I have thirty-two friends” is not (twenty-two being in the Vocabulary and thirty-two being OOV). I am not sure how subword segmentation would solve this as Dutch needs to reverse the two numbers, i.e. two-and-thirty?
It is expected that the subword model will learn to split numbers like two-* and-* thirty, and the NMT system will learn then to produce the entire number sequence in a good way -the same as it has seen in the training data-, so I guess the NMT model will learn to produce the reversed translation you want to appear for Dutch.
But, of course, it is all depending on the training data. If your training data don’t contain this kind of sequences of numbers, neither the subword model nor the NMT system will learn to segment and translate the numbers in the way you want. In fact, I think this is the phenomenon you are observing with the translation of “twenty-two” and “thirty-two”. In the latter case, the system does not know how to handle the word (or the related subwords) because it does not have seen them during training, so it left it handled as an OOV using the attention information which, as @guillaumekln said, it can produce errors.
I’ve an interesting observation about the phrase table option for dealing with OOV’s.
In my translation “The girl eats an apple” I see the model has learned to use “an” before apple instead of “a”, which is nearly always correct usage in English: a -> an before [aeiou].* with a very few exceptions.
“Apple” is “in vocabulary” and indeed the noun phase “an apple” occurs in the training data.
But I also have a translation “He suffers from a abscess”. The Dutch “etterbuil” is OOV so the translation “abscess” was taken from the phrase table. The use of “a” before “abscess” is wrong. Am I therefore right to assume that the rules the model has learned are not being applied to a word taken from the phrase table to replace an OOV token?
I have built a server which sits between the client/plug-in and the rest_translation_server to deal with such minor issues, like compound splitting and numerical entity handling.
The networks isn’t learning such “a -> an before [aeiou].*” rule. The main first reason is that, internally, words are vector encoded (embeddings), and not used in their alphabetical forms. You could have any text/code in place of a word, it would be exactly the same in the way the network is learning the sentences. For the network, each word is only, at best, the number it get in the dict files.
This said, unknown words are replaced afterwards, when the translation is completely done. The phrase table isn’t used during the translation process itself. This replacement is only possible if:
the translator really put an unknown word in the resulting translation. In fact, even with unknown words in input, it can produce only known words in its translation. Think of a simple fact : known words can be different from input and output languages.
the internal attention module is properly associating the unknown word in output with the right word in input
This is quite fascinating. If I request “De etterbuil van mijn vader is groot”, I get
"My father’s abscess is big" and NOT “The abscess of my father is big.” So the network has cleverly learned how to form the English possessive and applies it to what is then an unknown token.
It didn’t really learn how to build a possessive. It’s more ‘basic’ than this. It rather learned that such kind of tokens arrangement (including possibly an unk token) must be transformed in this other kind of tokens arrangement (including possibly an unk token), because statistically, it was done like this in examples.
I woud have been great if this fact is put in some documentation. from documentation i got an understand that OOV are checked for during the translation process. But as mentioned here, it is scanned only when model post attention gives an UNK. I still am figuring out how to get this done during the translation process itself. any pointers on where i should look for the code would be helpful.