We are using Okapi Tikal tool for extraction of text from HTML files, also we are restoring HTML after translation by Tikal as well. So source for translation have tags and we have strange translation of these lines. Just a few examples:
SENT 1069: <g id="1"> Winaray </g>
PRED 1069: would the Commission be forced to insist on the situation in Western Germany ?
PRED SCORE: -43.64
SENT 1071: <g id="1"> ייִדיש </g>
PRED 1071: would the Commission be forced to insist on the situation in Western Germany ?
PRED SCORE: -43.64
SENT 1073: <g id="1"> Yorùbá </g>
PRED 1073: would the Commission be forced to insist on the situation in Western Germany ?
PRED SCORE: -43.64
Any thoughts how we can fix it? We have same results with -replace_unk and without it.
I don’t know all placeholders so I can’t create specified model. Best case if model can return same non-translated word at same position. But as I know how network works - probably it’s not possible in direct way.
Now main questions - how to use Glossary with high priority than network translation and how to save non-translated words on their places without changes.
It all depends on the training data and what the model has learned.
Generally if a source word is out of vocabulary, it is likely that the network will translate it by an unknown word and then the -replace_unk feature might be able to copy it in the target. But remember that it is not doing a word by word translation so it might just drop it and the copy might also fail because the source word is unknown.
A more consistent approach is to apply the same logic as the named entities on the training data:
Personal thoughts: you could just build a radix tree with all the words you can translate and query it before each neural translation.