Protected sequences translation

eltorre · May 27, 2019, 1:50pm

Hi all,

we are trying to use the protected sequences feature (http://opennmt.net/OpenNMT/tools/tokenization/#special-characters) in order to translate some words in a different way, e.g. given the sentence

the vessel is christened by breaking a vessel

we have an external model that tells us that the first vessel is a ship and the second one is a bottle. To help the network generalize, we use the part of speech as as placeholder.

Also, we are targeting Spanish, so the first vessel would be barco and the second one botella. If I understand it correctly, if I leave it as ｟N：vessel｠ I would get vessel on the target side.

Hence we replace the sentence with

the｟N：barco｠ is christened by breaking a ｟N：botella｠

Target side looks like

el barco es bautizado rompiendo una botella

I cannot find any documentation or feature-complete tutorial about how to do this, so I assume preprocessing and training commands are unchanged compared to a baseline OpenNMT-lua model.

But, when translating the sentence, I get <unk> in place of the placeholders. When using -replace-unknown I get a random nearby word copied instead the placeholder.

I do not know what is exactly happening (or even if I am doing it correctly). The only thing that occurs to me is that ｟N｠is somehow not getting into the vocabulary, and OpenNMT is not capable of properly translating it.

Can you throw some light on this?
Thank you

edit: fixed the link

guillaumekln · May 27, 2019, 6:27pm

Hi,

The usage looks correct.

After preprocessing, are the protected sequences present in the generated vocabulary files (.dict files)?

eltorre · May 28, 2019, 12:57pm

Yes, protected sequences (in the ｟N｠shape) are present in the vocabulary files.

guillaumekln · May 29, 2019, 7:23am

Does it change anything if you use -placeholder_constraints during translation?

eltorre · May 29, 2019, 10:25am

When we use -placeholder_constraints true we get an <unk> tag instead the placeholder.
Using both -placeholder_constraints and -replace_unk, produces a nearby word in the output. I assume it is the one with the highest attention.