I have vocabulary files with 8967 amd 11220 src and tgt vocabularies each. However, when I do preprocessing with -src_vocab and -tgt_vocab flags, I get
Loaded src vocab has 8967 tokens. Loaded tgt vocab has 11220 tokens.
* src vocab size: 8969. * tgt vocab size: 11224.
The output vocab size is larger than the input. Do you know why this is the case? What are the extra vocabularies that the preprocessing is creating?