It is useful for OOV words translation (for example domain terminology). Feel free to describe and/or implement a better unknown words replacement mechanism.
In my recent practice at openNMT, -phrase_table option did not replace all OOV words, some OOV word often lost (if first word is OOV word), and it can not directly output oov word, so I gave up using this feature.
I think that before the exact or direct output oov words, this feature can not really be used.
Hi, As I have a Dutch-English/English-Dutch dictionary of some 300,000 words available I am finding this feature an extremely useful way of avoiding OOVâs. As Dutch has many compound nouns that translate into two or three tokens (i.e. systeemontwikkeling = system development) I have got round the single token limitation by replacing the space with an underscore (i.e. system_development) , the underscore is then removed in my client application.
Iâm now having problems with this feature in release v0.7. In earlier releases it did its job and replaced a with the appropriate target token in the phrase table (with many language pairs). I noticed this issue today as I had added all the written forms of the numbers (English>Dutch) up to a hundred to my phrase table. For the sentence âThis is better than fifty-one expertsâ I am now getting âDit is beter dan dan deskundigenâ (i.e. attention picks the nearest source word âthanâ and translates it as âdanâ). Has anything changed?
Iâm still puzzled by the behaviour of the apparently erratic phrase_table option. With a Malay-English model built with default settings, I take the word âwoksyopâ which does NOT occur in the training data and add it to my phrase table. It is then correctly translated as âworkshopâ at inference. However, with various English-Dutch models English compound numbers like twenty-three which are NOT included in the Source Vocabulary but ARE included in the backoff dictionary are not being translated and being replaced by the nearest source word. Any ideas? This is embarrassing when demonstrated this software to potential customers! Thanks.
Have you tried to build your models using any subword segmentation (bpe, morfessor)?
I guess that using that you will see the out-ouf-vocabulary problem reduced.
However, I think you should adapt your phrase table to the subword segmentation too, because the attention module (the one which is generating the soft alignments) now will be working at subword level.
Also, if you use that, donât forget to include a little post-process step to reattach the words. Notice that the input will have the form 'this is a sen* ten* ce pro* ofâ
and the output will be something similar to, for instance, âest* o es un* a fras* e de prueb* aâ in Spanish.
Additionally, I would suggest you to implement a little bit more sophisticated post-process in order to detect those weird words created when reconstructing the subword translation, maybe you can use an extended dictionary or a phrase table to do so.
Notice that now the translation errors will be made at subword level. So, if you have as input âthis is a sen* ten* ce pro* ofâ the system can produce âest* o es un* alg* un* a prueb* aâ
which attached results in: 'esto es unalguna pruebaâ
producing the word âunalgunaâ which does not exist in Spanish.
Regarding to numbers, as @guillaumekln told you here:
you can pre-process the training data to use placeholders for numbers and, after decoding, just substitute them for the corresponding source number (this may be the simplest approach). Remember that if you use this technique, the input of your system must have the placeholders for numbers too.
Hi Eva,
Thanks, thereâs some useful stuff there :-). On the Dutch2English side Iâve introduced my own Splitter (a kind of subword segmentation) and that solves many of the OOV issues including numerical entities. On the English2Dutch side I notice that if the number has been included in the Source Vocabulary it is translated, so âI have twenty-two friendsâ is handled correctly, but âI have thirty-two friendsâ is not (twenty-two being in the Vocabulary and thirty-two being OOV). I am not sure how subword segmentation would solve this as Dutch needs to reverse the two numbers, i.e. two-and-thirty?
I have been thinking of tackling this problem as an in-domain training problem and creating a hundred source & target sentences with numerical entities from one to a hundred. Iâll report back.
Terence
On the English2Dutch side I notice that if the number has been included in the Source Vocabulary it is translated, so âI have twenty-two friendsâ is handled correctly, but âI have thirty-two friendsâ is not (twenty-two being in the Vocabulary and thirty-two being OOV). I am not sure how subword segmentation would solve this as Dutch needs to reverse the two numbers, i.e. two-and-thirty?
It is expected that the subword model will learn to split numbers like two-* and-* thirty, and the NMT system will learn then to produce the entire number sequence in a good way -the same as it has seen in the training data-, so I guess the NMT model will learn to produce the reversed translation you want to appear for Dutch.
But, of course, it is all depending on the training data. If your training data donât contain this kind of sequences of numbers, neither the subword model nor the NMT system will learn to segment and translate the numbers in the way you want. In fact, I think this is the phenomenon you are observing with the translation of âtwenty-twoâ and âthirty-twoâ. In the latter case, the system does not know how to handle the word (or the related subwords) because it does not have seen them during training, so it left it handled as an OOV using the attention information which, as @guillaumekln said, it can produce errors.
Iâve an interesting observation about the phrase table option for dealing with OOVâs.
In my translation âThe girl eats an appleâ I see the model has learned to use âanâ before apple instead of âaâ, which is nearly always correct usage in English: a -> an before [aeiou].* with a very few exceptions.
âAppleâ is âin vocabularyâ and indeed the noun phase âan appleâ occurs in the training data.
But I also have a translation âHe suffers from a abscessâ. The Dutch âetterbuilâ is OOV so the translation âabscessâ was taken from the phrase table. The use of âaâ before âabscessâ is wrong. Am I therefore right to assume that the rules the model has learned are not being applied to a word taken from the phrase table to replace an OOV token?
I have built a server which sits between the client/plug-in and the rest_translation_server to deal with such minor issues, like compound splitting and numerical entity handling.
The networks isnât learning such âa -> an before [aeiou].*â rule. The main first reason is that, internally, words are vector encoded (embeddings), and not used in their alphabetical forms. You could have any text/code in place of a word, it would be exactly the same in the way the network is learning the sentences. For the network, each word is only, at best, the number it get in the dict files.
This said, unknown words are replaced afterwards, when the translation is completely done. The phrase table isnât used during the translation process itself. This replacement is only possible if:
the translator really put an unknown word in the resulting translation. In fact, even with unknown words in input, it can produce only known words in its translation. Think of a simple fact : known words can be different from input and output languages.
the internal attention module is properly associating the unknown word in output with the right word in input
As âetterbuilâ is OOV, the model generates âa <unk>â because it is more likely to be correct than âan <unk>â given the information available to the model.
With subword tokenization like BPE, this issue would certainly not appear.
This is quite fascinating. If I request âDe etterbuil van mijn vader is grootâ, I get
"My fatherâs abscess is big" and NOT âThe abscess of my father is big.â So the network has cleverly learned how to form the English possessive and applies it to what is then an unknown token.
It didnât really learn how to build a possessive. Itâs more âbasicâ than this. It rather learned that such kind of tokens arrangement (including possibly an unk token) must be transformed in this other kind of tokens arrangement (including possibly an unk token), because statistically, it was done like this in examples.
I woud have been great if this fact is put in some documentation. from documentation i got an understand that OOV are checked for during the translation process. But as mentioned here, it is scanned only when model post attention gives an UNK. I still am figuring out how to get this done during the translation process itself. any pointers on where i should look for the code would be helpful.