There is -replace_unk , but it works not so perfect - sometimes it puts wrong src word.
Some people recomended to use subword tokenization.
I tried opennmttok tokenization - sentencepiece unigram and BPE.
But it doesn’t create
<unk> tokens in the output.
So, what am I doing wrong?
Maybe there is the way to make translate.py do not translate some words?
For example, sentencepiece have placeholders (or protected sequences): sequence of characters delimited by ｟ and ｠ that should not be segmented.
Maybe we can use it someway?