Hi all,
we are trying to use the protected sequences feature (http://opennmt.net/OpenNMT/tools/tokenization/#special-characters) in order to translate some words in a different way, e.g. given the sentence
the vessel is christened by breaking a vessel
we have an external model that tells us that the first vessel is a ship and the second one is a bottle. To help the network generalize, we use the part of speech as as placeholder.
Also, we are targeting Spanish, so the first vessel would be barco and the second one botella. If I understand it correctly, if I leave it as ⦅N:vessel⦆ I would get vessel on the target side.
Hence we replace the sentence with
the⦅N:barco⦆ is christened by breaking a ⦅N:botella⦆
Target side looks like
el barco es bautizado rompiendo una botella
I cannot find any documentation or feature-complete tutorial about how to do this, so I assume preprocessing and training commands are unchanged compared to a baseline OpenNMT-lua model.
But, when translating the sentence, I get <unk>
in place of the placeholders. When using -replace-unknown
I get a random nearby word copied instead the placeholder.
I do not know what is exactly happening (or even if I am doing it correctly). The only thing that occurs to me is that ⦅N⦆is somehow not getting into the vocabulary, and OpenNMT is not capable of properly translating it.
Can you throw some light on this?
Thank you
edit: fixed the link