<unk> after sentencepiece tokenization

praveen.dakwale · December 30, 2020, 9:58am

Hi,

I trained model on data tokenized using onmt_build_vocab sentencepiece for a large enough vocab size along with BPE. In the tokenized training data I don’t see any <unk> tokens, neither in the generated vocabs. During sentencepiece training it says 99.99% characters covered. However, I see <unk> in some cases while running test for this model. My assumption was that we shouldn’t see <unk> in translation output after sentencepiece tokenization. Is there a <unk> token being added during training? The test inputs are also tokenized with the same model.
Thanks

Nart · January 2, 2021, 12:35pm

Hello Praveen,
Getting unk’s mean there are characters/symbols in the test inputs that are not part of the vocab.

To solve this issue you could augment the test inputs to the training data to generate the vocab and sentencepiece tokenizer, this is just in the preprocess step.
Another approach is to identify those alien symbols and remove them from the test inputs.
i.e using grep on terminal:
grep -o ‘[[:print:]]’ train_data.txt |sort|uniq
grep -o ‘[[:print:]]’ test_inputs.txt |sort|uniq
Whatever symbol that is in the 2nd grep but not in the 1st grep should be removed from the test inputs.

Which approach to take depends if those symbols are just noise then the second approach, or if they are as important then the first approach.