Lexical constraints results


(bsbor) #1

Hello,
I’m trying to use the lexical constraint decoding feature.

Here is my translate command : th translate.lua -model mymodel -src filetotranslate -output outputpath -replace_unk -lexical_constraints -phrase_table mytable -gpuid 1

I’ve run several tests with only one line in my phrase table file.
I didn’t expect those results.

I’ve found that if, for a source/target pair of the phrase table, the target word is OOV (for our model), the lexical constraint decoding does nothing and doesn’t replace the source word. Is it normal ?

I’ve another strange output with the following test.
In my file to translate I’ve lines with one word. For example : Bonjour.
If I use a phrase table with “Bonjour|||test”, there’s no replacement of “Bonjour” by “test” with the lexical constraints decoding. Morevover, the translation repeats the word Hello (if I do the same translation with no lexical constraint decoding I have the right translation “Hello”).

If it can help, for in some tests the lexical decoding correctly replace the word I’ve put in my phrase table.

Do you know what is happening ? Have you got nice results with lexical constraints decoding ?

Bsbor


Handle numbers, urls, dates
(Guillaume Klein) #2

cc @natsegal


(Natalia Segal) #3

Hello @bsbor,

The -lexical_constraints option is intended to enforce the presence of the in-vocabulary words.
If you want to use your phrase table for OOV replacements, in addition to lexical constraints, you should explicitly ask for it by using the -replace_unk option.

As for your second question, I do not reproduce your result with my FREN model. I get “Hello test” or “Good test .” as translations with lexical constraints (with or without a period).
Note, however, that lexical constraints are intended to enforce some plausible translation, from the model’s point of view.
The typical use would be to enforce a consistent terminology over the whole translation process.
“Test” is a very unlikely translation for “Bonjour” in FREN, so the model might easily produce some unexpected results if forced to maintain “test” in the output.

Note as well that this is still work in progress, so far we only used this option to maintain some placeholders in the translation output.


(bsbor) #4

@natsegal Thanks a lot for your detailed answer.


(Wiktor Stribiżew) #5

Could you please share some more detailed steps on how to “maintain some placeholders in the translation output”? Say, there are some entities I want to remain as untranslated, as is. Do I have to worry about them already at the corpora tokenization step?

I tried to introduce specifically tokenized words in the text to translate, but I see those words disappear from translation.

I tested that using a REST translation server:

  1. Ran th tools/rest_translation_server.lua -model ${model} -gpuid 1 -port 5678 -mode aggressive -segment_numbers -segment_alphabet_change -segment_alphabet {Han,Thai,Katakana,Hiragana} -joiner_annotate -case_feature -bpe_model ${bpe} -log_level DEBUG -phrase_table ${pt} -lexical_constraints 1 -replace_unk 1
  2. My PT file contains “bonjour.com|||another.com
  3. Ran curl -v -H "Content-Type: application/json" -X POST -d '[{ "src" : "Hello ⦅bonjour.com⦆" }]' http://127.0.0.1:5678/translator/translate to translate into “Hello(in the target lang.) another.com”.
  4. The translation did not consider the PT entry.

Could you please help understand this behavior? Probably, it is related to http://opennmt.net/OpenNMT/tools/tokenization/#normalization and -placeholder_constraints? Adding -placeholder_constraints 1 did not help either.