Translation: unknown words replaced with one word, instead of <unk>

Me and my colleague faced the same problem.
I use rnn, he uses transformer models.
English->Russian.
So, when I test model with custom sentences to translate.py without any options I expect to see an <unk> in the places of OOV. But I always see the same word (just random russian word “переплетение”) instead of <unk>.
Even if I send “qwertyasdfgh” I get “переплетение”.
What’s wrong?
I use default options for peprocess/train/translate.

How did you generate the vocabulary? Possibly the training did not see any OOV and so it fails to encode/decode them during inference.

This is how I run preprocess and train:

onmt_preprocess -src_vocab_size 1000000 -tgt_vocab_size 1000000 -train_src data/raw/train.clean.en -train_tgt data/raw/train.clean.ru -valid_src data/raw/tune.clean.en -valid_tgt data/raw/tune.clean.ru -save_data data/demo

onmt_train -world_size 1 -gpu_ranks 0 -data data/demo -save_model data/demo-model

And yeah, I clean my dataset from most of rare words (misprints and so on).

So, you mean I have to generate vocabulary with options like this:
-src_vocab_size 50000 -tgt_vocab_size 100000

But I will lose a lot of words then. Sure, I have to use BPE or something like this, but if I want to use words as tokens - what should I do?

This vocabulary size is too large. The issue mentioned in the first post is a direct consequence of it.

Sure, I have to use BPE or something like this, but if I want to use words as tokens - what should I do?

You could replace some aligned source/target words in the data by <unk> before starting the training. You should give enough examples of the <unk> token for the model to know how/when to produce it.

I didn’t get it.
You mean I have to prepare parallel data - so sometimes in the target file there is an <unk> token? Randomly?
Or I have to reduce vocabulary size so it creates <unk>?
Now I have this sizes of vocabulary:
src: 77k
tgt: 252k
And I am not sure how to choose -src_vocab_size and -tgt_vocab_size correctly.
Because if I reduce tgt vocabulary size it could remove even some forms of the popular words.

As you prefer not to reduce the vocabulary size, this is indeed the approach I’m proposing. It’s not only in the target side though: it should be an aligned <unk> token so that the model knows to translate <unk> to <unk>.

But system will learn to translate unk to unk.
And when it finds unknown word in the input sentence - how could it know that it should put unk mark?

<unk> is the default token for words not found in the vocabulary.

So, finally: what do you propose?
How to keep vocabulary and teach system to put unk marks for unknown words?

This:

e.g. you have this tokenized training example:

Hello world !  -> Bonjour le monde !

you can then write a custom preprocessing script that randomly drops words, e.g.:

Hello <unk> !  -> Bonjour le <unk> !

If that sounds complex, just use BPE or SentencePiece. :wink:

Yes, got it. Thanks.
It was just a little misunderstanding.

One more question.
What about Sentencepiece? Does it support <unk>?
I heard that BPE is used to avoid <unk> - it helps to predict some word in any cases (even if source word is unknown).
Does it mean, that if I use sentencepiece model will never output <unk>?
And I didn’t find any tutorials of using sentencepiece with opennmt.
Does it mean that I just have to:

  1. train model for sentencepiece (on my training set? or I have to use another set?)
  2. use it to tokenize all my set (train, validate, test)
  3. train RNN/Transformer model with tokenized set
  4. use tokenization/detokenization every time I want to translate something by adding this options to server.py config file:
    “tokenizer”: {
    “type”: “sentencepiece”,
    “model”: “my.model”
    }
    So, opennmt doesn’t have to know that it uses part of words as tokens, not whole words?