-phrase_table option


(Vermasrijan) #21

Hi Guillaume,
I am somehow not able to use -phrase_table functionality, and thus need help regarding it.
let us say there is a word in my test file, ‘srijan’ , which is not there in the training/validation dataset.

My phrase table:
srijan ||| सृजन
srijan . ||| सृजन .
srijan .|||सृजन .
Srijan|||सृजन

My src-test file:
srijan
Srijan .
Srijan .
Hello .
home ground

Command used:
th translate.lua -replace_unk true -phrase_table data/PhraseTable.txt -model model_epoch50_22.66.t7 -src data/src-test1.txt -output pred1.txt

Predicted word:
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।

According to me, the model is somehow not able to look in the phrase table and instead is giving the output ‘हैदराबाद ।’ which probably got the highest attention.
I want the translated output to be ‘सृजन’ instead.

Any help would be highly appreciated.
Thanks in advance.


(Vermasrijan) #22

Hi Terence,
I am somehow not able to use -phrase_table functionality, and thus need help regarding it.
let us say there is a word in my test file, ‘srijan’ , which is not there in the training/validation dataset.

My phrase table:
srijan ||| सृजन
srijan . ||| सृजन .
srijan .|||सृजन .
Srijan|||सृजन

My src-test file:
srijan
Srijan .
Srijan .
Hello .
home ground

Command used:
th translate.lua -replace_unk true -phrase_table data/PhraseTable.txt -model model_epoch50_22.66.t7 -src data/src-test1.txt -output pred1.txt

Predicted word:
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।
हैदराबाद ।

According to me, the model is somehow not able to look in the phrase table and instead is giving the output ‘हैदराबाद ।’ which probably got the highest attention.
I want the translated output to be ‘सृजन’ instead.

Any help would be highly appreciated.
Thanks in advance.


(Terence Lewis) #23

Hi Vermasrijan,
I have just tried this again. In my Indonesian-English model I type “Saya makan alpukat di restoran”. I know that “alpukat” is - strangely - not in the training data. The model gives me “I eat avocado in the restaurant” taking the translation of avocado from the phrase table. My only difference with your command is that I put the phrase_table option before -replace_unk. Did you try your test with slightly longer sentences?
Terence


(Eva) #24

Hi @vermasrijan!

I would check your phrase table entries.
If you look at the OpenNMT-lua documentation you’ll see that each entry in the phrase table should be:

source|||target

where source and target are case sensitive and single tokens.
Thus, for instance, your entry srijan ||| सृजन should be srijan|||सृजन .

So, in your example, using your phrase table as it is, only the word form Srijan would be translated properly (as you want).
But, keep in mind that the phrase table is applied on the source word retrieved with the highest attention probability, so it could be that the source word you want to tackle in a particular sentence doesn’t match the one considered by the model.

Best,
Eva


(Vermasrijan) #25

Thanks for replying @emartinezVic.
@emartinezVic, I was wondering if there a way in which I will be able to look in the phrase_table directly, bypassing attention mechanism (for the OOV words in my test data) .
For eg: Let us say, I will be having some ‘N’ words (OOV words) in every 3 months. (these words would not be trained with the training set). I want to make a phrase table for these ‘N’ words so that I directly look in the table to get their translation.
Basically, I am just looking for a way to deal with OOV words.
I thought of 3 approaches to deal with this:
First, to use the phrase table, which OpenNMT provides.
Second, to capture all the OOV words from my test data, send them to some other translator (like Google Translator, since gnmt has been trained on some billions of data) and then get their corresponding translation.
Third, if an OOV word is there in my test set, somehow find a way to copy-paste the exact source OOV word in my target translation (the position of the OOV word in my translation should be structurally correct, in reference with the rest of the words in the sentence).

Having insights on any of the three approaches^ would be really helpful to me.

Thanks in advance!
Srijan


(Vermasrijan) #26

Thanks for replying Terence.
Sure, I will try this out with a longer sentence and check the results.

Thanks!


(Vermasrijan) #27

Thanks for replying @emartinezVic.
@emartinezVic, I was wondering if there a way in which I will be able to look in the phrase_table directly, bypassing attention mechanism (for the OOV words in my test data) .
For eg: Let us say, I will be having some ‘N’ words (OOV words) in every 3 months. (these words would not be trained with the training set). I want to make a phrase table for these ‘N’ words so that I directly look in the table to get their translation.
Basically, I am just looking for a way to deal with OOV words.
I thought of 3 approaches to deal with this:
First, to use the phrase table, which OpenNMT provides.
Second, to capture all the OOV words from my test data, send them to some other translator (like Google Translator, since gnmt has been trained on some billions of data) and then get their corresponding translation.
Third, if an OOV word is there in my test set, somehow find a way to copy-paste the exact source OOV word in my target translation (the position of the OOV word in my translation should be structurally correct, in reference with the rest of the words in the sentence).

Having insights on any of the three approaches^ would be really helpful to me.

Thanks in advance!
Srijan


(Terence Lewis) #28

Hi @vermasrijan, In my production set-up I accomplish something along the lines you are describing in a post-processing stage, i.e. outside OpenNMT. The routine checks the OpenNMT output for untranslated words and retrieves the translations from an external dictionary. Unlike the OpenNMT phrase table this dictionary can include mutliword expressions.