There is -replace_unk , but it works not so perfect - sometimes it puts wrong src word.
Some people recomended to use subword tokenization.
I tried opennmttok tokenization - sentencepiece unigram and BPE.
But it doesn’t create <unk> tokens in the output.
So, what am I doing wrong?
Maybe there is the way to make translate.py do not translate some words?
For example, sentencepiece have placeholders (or protected sequences): sequence of characters delimited by ⦅ and ⦆ that should not be segmented.
Maybe we can use it someway?
There is -replace_unk , but it works not so perfect - sometimes it puts wrong src word.
It seems that -replace_unk only works with the LTMS and RNN architectures.
But it doesn’t create <unk> tokens in the output.
The big benefit of byte pair encoding is that you don’t have any unknown words.
Which words shouldn’t be translated? You could try to mark those words over the additional input feature. The training data should be appropriated labeled. If your model translates in both directions and has a shared vocabulary between both languages, then the training data could be mixed with full marked copied monolingual data which improves “accuracy on named entities and other words that should remain identical between the source and target languages” (cited from Copied Monolingual Data Improves Low-Resource Neural MachineTranslation by Anna Currey, Antonio Valerio Miceli Barone and Kenneth Heafield). This approach doesn’t force the retention but the proper copy attention architecture could yield good results.
i tried this with my RNN model - sometimes it replaces <unk> wrong.
And in context of OpenNMT-py - RNN and LSTM is the same thing, right? -encoder_type and -decoder_type has only rnn option.
I thought about names, numbers, some rare specific words. Now I don’t have numbers in my train set, because I don’t know how to process it.
Mobile data services are suspended in some parts of Delhi close to protest sites.
Услуги мобильных служб <unk> в некоторых частях дели близки к <unk> сайтам.
Услуги мобильных служб suspended в некоторых частях дели близки к sites сайтам.
Here we can see that -replace_unk replaced:
“suspended” - correctly
“protest” - “sites”, so as a result we can see “sites сайтам” (same word) in the end of the sentence.
And I am not sure why it doesn’t know word “protest”, if I input just one word “protest” it translates correctly.