Translation: unknown words replaced with one word, instead of <unk>

kargintima · December 6, 2019, 2:29pm

Me and my colleague faced the same problem.
I use rnn, he uses transformer models.
English->Russian.
So, when I test model with custom sentences to translate.py without any options I expect to see an <unk> in the places of OOV. But I always see the same word (just random russian word “переплетение”) instead of <unk>.
Even if I send “qwertyasdfgh” I get “переплетение”.
What’s wrong?
I use default options for peprocess/train/translate.

guillaumekln · December 7, 2019, 10:21am

How did you generate the vocabulary? Possibly the training did not see any OOV and so it fails to encode/decode them during inference.

kargintima · December 9, 2019, 8:41am

This is how I run preprocess and train:

onmt_preprocess -src_vocab_size 1000000 -tgt_vocab_size 1000000 -train_src data/raw/train.clean.en -train_tgt data/raw/train.clean.ru -valid_src data/raw/tune.clean.en -valid_tgt data/raw/tune.clean.ru -save_data data/demo

onmt_train -world_size 1 -gpu_ranks 0 -data data/demo -save_model data/demo-model

And yeah, I clean my dataset from most of rare words (misprints and so on).

So, you mean I have to generate vocabulary with options like this:
-src_vocab_size 50000 -tgt_vocab_size 100000

But I will lose a lot of words then. Sure, I have to use BPE or something like this, but if I want to use words as tokens - what should I do?

guillaumekln · December 9, 2019, 8:54am

This vocabulary size is too large. The issue mentioned in the first post is a direct consequence of it.

Sure, I have to use BPE or something like this, but if I want to use words as tokens - what should I do?

You could replace some aligned source/target words in the data by <unk> before starting the training. You should give enough examples of the <unk> token for the model to know how/when to produce it.

kargintima · December 9, 2019, 9:13am

I didn’t get it.
You mean I have to prepare parallel data - so sometimes in the target file there is an <unk> token? Randomly?
Or I have to reduce vocabulary size so it creates <unk>?
Now I have this sizes of vocabulary:
src: 77k
tgt: 252k
And I am not sure how to choose -src_vocab_size and -tgt_vocab_size correctly.
Because if I reduce tgt vocabulary size it could remove even some forms of the popular words.

guillaumekln · December 9, 2019, 9:16am

As you prefer not to reduce the vocabulary size, this is indeed the approach I’m proposing. It’s not only in the target side though: it should be an aligned <unk> token so that the model knows to translate <unk> to <unk>.

kargintima · December 9, 2019, 9:20am

But system will learn to translate unk to unk.
And when it finds unknown word in the input sentence - how could it know that it should put unk mark?

guillaumekln · December 9, 2019, 9:28am

<unk> is the default token for words not found in the vocabulary.

kargintima · December 9, 2019, 10:23am

So, finally: what do you propose?
How to keep vocabulary and teach system to put unk marks for unknown words?

guillaumekln · December 9, 2019, 10:29am

This:

e.g. you have this tokenized training example:

Hello world !  -> Bonjour le monde !

you can then write a custom preprocessing script that randomly drops words, e.g.:

Hello <unk> !  -> Bonjour le <unk> !

If that sounds complex, just use BPE or SentencePiece.

kargintima · December 9, 2019, 10:34am

Yes, got it. Thanks.
It was just a little misunderstanding.

kargintima · December 11, 2019, 8:23am

One more question.
What about Sentencepiece? Does it support <unk>?
I heard that BPE is used to avoid <unk> - it helps to predict some word in any cases (even if source word is unknown).
Does it mean, that if I use sentencepiece model will never output <unk>?
And I didn’t find any tutorials of using sentencepiece with opennmt.
Does it mean that I just have to:

train model for sentencepiece (on my training set? or I have to use another set?)
use it to tokenize all my set (train, validate, test)
train RNN/Transformer model with tokenized set
use tokenization/detokenization every time I want to translate something by adding this options to server.py config file:
“tokenizer”: {
“type”: “sentencepiece”,
“model”: “my.model”
}
So, opennmt doesn’t have to know that it uses part of words as tokens, not whole words?