Hi Guillaume,
I am somehow not able to use -phrase_table functionality, and thus need help regarding it.
let us say there is a word in my test file, ‘srijan’ , which is not there in the training/validation dataset.
Predicted word:
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
According to me, the model is somehow not able to look in the phrase table and instead is giving the output ‘हैदराबाद ।’ which probably got the highest attention.
I want the translated output to be ‘सृजन’ instead.
Any help would be highly appreciated.
Thanks in advance.
Hi Terence,
I am somehow not able to use -phrase_table functionality, and thus need help regarding it.
let us say there is a word in my test file, ‘srijan’ , which is not there in the training/validation dataset.
Predicted word:
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
According to me, the model is somehow not able to look in the phrase table and instead is giving the output ‘हैदराबाद ।’ which probably got the highest attention.
I want the translated output to be ‘सृजन’ instead.
Any help would be highly appreciated.
Thanks in advance.
Hi Vermasrijan,
I have just tried this again. In my Indonesian-English model I type “Saya makan alpukat di restoran”. I know that “alpukat” is - strangely - not in the training data. The model gives me “I eat avocado in the restaurant” taking the translation of avocado from the phrase table. My only difference with your command is that I put the phrase_table option before -replace_unk. Did you try your test with slightly longer sentences?
Terence
I would check your phrase table entries.
If you look at the OpenNMT-lua documentation you’ll see that each entry in the phrase table should be:
source|||target
where source and target are case sensitive and single tokens.
Thus, for instance, your entry srijan ||| सृजन should be srijan|||सृजन .
So, in your example, using your phrase table as it is, only the word form Srijan would be translated properly (as you want).
But, keep in mind that the phrase table is applied on the source word retrieved with the highest attention probability, so it could be that the source word you want to tackle in a particular sentence doesn’t match the one considered by the model.
Thanks for replying @emartinezVic. @emartinezVic, I was wondering if there a way in which I will be able to look in the phrase_table directly, bypassing attention mechanism (for the OOV words in my test data) .
For eg: Let us say, I will be having some ‘N’ words (OOV words) in every 3 months. (these words would not be trained with the training set). I want to make a phrase table for these ‘N’ words so that I directly look in the table to get their translation.
Basically, I am just looking for a way to deal with OOV words.
I thought of 3 approaches to deal with this:
First, to use the phrase table, which OpenNMT provides.
Second, to capture all the OOV words from my test data, send them to some other translator (like Google Translator, since gnmt has been trained on some billions of data) and then get their corresponding translation.
Third, if an OOV word is there in my test set, somehow find a way to copy-paste the exact source OOV word in my target translation (the position of the OOV word in my translation should be structurally correct, in reference with the rest of the words in the sentence).
Having insights on any of the three approaches^ would be really helpful to me.
Thanks for replying @emartinezVic. @emartinezVic, I was wondering if there a way in which I will be able to look in the phrase_table directly, bypassing attention mechanism (for the OOV words in my test data) .
For eg: Let us say, I will be having some ‘N’ words (OOV words) in every 3 months. (these words would not be trained with the training set). I want to make a phrase table for these ‘N’ words so that I directly look in the table to get their translation.
Basically, I am just looking for a way to deal with OOV words.
I thought of 3 approaches to deal with this:
First, to use the phrase table, which OpenNMT provides.
Second, to capture all the OOV words from my test data, send them to some other translator (like Google Translator, since gnmt has been trained on some billions of data) and then get their corresponding translation.
Third, if an OOV word is there in my test set, somehow find a way to copy-paste the exact source OOV word in my target translation (the position of the OOV word in my translation should be structurally correct, in reference with the rest of the words in the sentence).
Having insights on any of the three approaches^ would be really helpful to me.
Hi @vermasrijan, In my production set-up I accomplish something along the lines you are describing in a post-processing stage, i.e. outside OpenNMT. The routine checks the OpenNMT output for untranslated words and retrieves the translations from an external dictionary. Unlike the OpenNMT phrase table this dictionary can include mutliword expressions.
I see that -phrase_table option is absolutely useless in case a BPE (or similar technique) was used.
I tried to manually join a term chunks (split by BPE + joiner annotate + case feature) into a single word - hoping that the joined_term|N will be treated as an OOV and thus handled with a phrase table - and pass that string to the translate.lua, but the term translation was not coming from the phrase table, I only got some gibberish.
Hi Srijan,
I am encountering the same problem with phrase tables.i want the words to be replaced with the appropriate words in my phrase table. But that is not working as I wanted it to. I wanted to know how you solved the problem?