OpenNMT Forum

Vocabulary size with src_vocab and tgt_vocab

I have vocabulary files with 8967 amd 11220 src and tgt vocabularies each. However, when I do preprocessing with -src_vocab and -tgt_vocab flags, I get

Loaded src vocab has 8967 tokens.
Loaded tgt vocab has 11220 tokens.
* src vocab size: 8969.
* tgt vocab size: 11224.

The output vocab size is larger than the input. Do you know why this is the case? What are the extra vocabularies that the preprocessing is creating?

There are some special tokens that are added.

Source:

  • <blank> padding
  • <unk> unknown

Target:

  • <blank> padding
  • <unk> unknown
  • <s> start of sentence
  • </s> end of sentence

Ah, ok. That makes sense. Thanks!