Vocabulary size with src_vocab and tgt_vocab

ArbinTimilsina · May 20, 2019, 1:11pm

I have vocabulary files with 8967 amd 11220 src and tgt vocabularies each. However, when I do preprocessing with -src_vocab and -tgt_vocab flags, I get

Loaded src vocab has 8967 tokens.
Loaded tgt vocab has 11220 tokens.

* src vocab size: 8969.
* tgt vocab size: 11224.

The output vocab size is larger than the input. Do you know why this is the case? What are the extra vocabularies that the preprocessing is creating?

guillaumekln · May 20, 2019, 2:35pm

There are some special tokens that are added.

Source:

<blank> padding
<unk> unknown

Target:

<blank> padding
<unk> unknown
<s> start of sentence
</s> end of sentence

ArbinTimilsina · May 20, 2019, 2:38pm

Ah, ok. That makes sense. Thanks!