-phrase_table option

vermasrijan · July 7, 2018, 11:44am

Hi Guillaume,
I am somehow not able to use -phrase_table functionality, and thus need help regarding it.
let us say there is a word in my test file, ‘srijan’ , which is not there in the training/validation dataset.

My phrase table:
srijan ||| सृजन
srijan . ||| सृजन .
srijan .|||सृजन .
Srijan|||सृजन

My src-test file:
srijan
Srijan .
Srijan .
Hello .
home ground

Command used:
th translate.lua -replace_unk true -phrase_table data/PhraseTable.txt -model model_epoch50_22.66.t7 -src data/src-test1.txt -output pred1.txt

Predicted word:
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।

According to me, the model is somehow not able to look in the phrase table and instead is giving the output ‘हैदराबाद ।’ which probably got the highest attention.
I want the translated output to be ‘सृजन’ instead.

Any help would be highly appreciated.
Thanks in advance.

vermasrijan · July 7, 2018, 12:17pm

Hi Terence,
I am somehow not able to use -phrase_table functionality, and thus need help regarding it.
let us say there is a word in my test file, ‘srijan’ , which is not there in the training/validation dataset.

My phrase table:
srijan ||| सृजन
srijan . ||| सृजन .
srijan .|||सृजन .
Srijan|||सृजन

My src-test file:
srijan
Srijan .
Srijan .
Hello .
home ground

Command used:
th translate.lua -replace_unk true -phrase_table data/PhraseTable.txt -model model_epoch50_22.66.t7 -src data/src-test1.txt -output pred1.txt

Predicted word:
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।

According to me, the model is somehow not able to look in the phrase table and instead is giving the output ‘हैदराबाद ।’ which probably got the highest attention.
I want the translated output to be ‘सृजन’ instead.

Any help would be highly appreciated.
Thanks in advance.

tel34 · July 7, 2018, 7:25pm

Hi Vermasrijan,
I have just tried this again. In my Indonesian-English model I type “Saya makan alpukat di restoran”. I know that “alpukat” is - strangely - not in the training data. The model gives me “I eat avocado in the restaurant” taking the translation of avocado from the phrase table. My only difference with your command is that I put the phrase_table option before -replace_unk. Did you try your test with slightly longer sentences?
Terence

emartinezVic · July 9, 2018, 7:30am

Hi @vermasrijan!

I would check your phrase table entries.
If you look at the OpenNMT-lua documentation you’ll see that each entry in the phrase table should be:

source|||target

where source and target are case sensitive and single tokens.
Thus, for instance, your entry srijan ||| सृजन should be srijan|||सृजन .

So, in your example, using your phrase table as it is, only the word form Srijan would be translated properly (as you want).
But, keep in mind that the phrase table is applied on the source word retrieved with the highest attention probability, so it could be that the source word you want to tackle in a particular sentence doesn’t match the one considered by the model.

Best,
Eva

vermasrijan · July 9, 2018, 8:11am

Thanks for replying @emartinezVic.
@emartinezVic, I was wondering if there a way in which I will be able to look in the phrase_table directly, bypassing attention mechanism (for the OOV words in my test data) .
For eg: Let us say, I will be having some ‘N’ words (OOV words) in every 3 months. (these words would not be trained with the training set). I want to make a phrase table for these ‘N’ words so that I directly look in the table to get their translation.
Basically, I am just looking for a way to deal with OOV words.
I thought of 3 approaches to deal with this:
First, to use the phrase table, which OpenNMT provides.
Second, to capture all the OOV words from my test data, send them to some other translator (like Google Translator, since gnmt has been trained on some billions of data) and then get their corresponding translation.
Third, if an OOV word is there in my test set, somehow find a way to copy-paste the exact source OOV word in my target translation (the position of the OOV word in my translation should be structurally correct, in reference with the rest of the words in the sentence).

Having insights on any of the three approaches^ would be really helpful to me.

Thanks in advance!
Srijan

vermasrijan · July 9, 2018, 8:12am

Thanks for replying Terence.
Sure, I will try this out with a longer sentence and check the results.

Thanks!

vermasrijan · July 10, 2018, 5:24am

Thanks for replying @emartinezVic.
@emartinezVic, I was wondering if there a way in which I will be able to look in the phrase_table directly, bypassing attention mechanism (for the OOV words in my test data) .
For eg: Let us say, I will be having some ‘N’ words (OOV words) in every 3 months. (these words would not be trained with the training set). I want to make a phrase table for these ‘N’ words so that I directly look in the table to get their translation.
Basically, I am just looking for a way to deal with OOV words.
I thought of 3 approaches to deal with this:
First, to use the phrase table, which OpenNMT provides.
Second, to capture all the OOV words from my test data, send them to some other translator (like Google Translator, since gnmt has been trained on some billions of data) and then get their corresponding translation.
Third, if an OOV word is there in my test set, somehow find a way to copy-paste the exact source OOV word in my target translation (the position of the OOV word in my translation should be structurally correct, in reference with the rest of the words in the sentence).

Having insights on any of the three approaches^ would be really helpful to me.

Thanks in advance!
Srijan

tel34 · July 10, 2018, 8:35am

Hi @vermasrijan, In my production set-up I accomplish something along the lines you are describing in a post-processing stage, i.e. outside OpenNMT. The routine checks the OpenNMT output for untranslated words and retrieves the translations from an external dictionary. Unlike the OpenNMT phrase table this dictionary can include mutliword expressions.

wiktor.stribizew · September 13, 2018, 7:29am

I see that -phrase_table option is absolutely useless in case a BPE (or similar technique) was used.

I tried to manually join a term chunks (split by BPE + joiner annotate + case feature) into a single word - hoping that the joined_term|N will be treated as an OOV and thus handled with a phrase table - and pass that string to the translate.lua, but the term translation was not coming from the phrase table, I only got some gibberish.

tel34 · September 13, 2018, 8:44am

I agree. This is one of the reasons why I have not used it in production set-ups.

kreeteeekah · June 10, 2019, 8:00am

Hi Srijan,
I am encountering the same problem with phrase tables.i want the words to be replaced with the appropriate words in my phrase table. But that is not working as I wanted it to. I wanted to know how you solved the problem?