Small vocabs when running build_vocab.lua on BPE tokenized data

I’ve noticed that the past few data sets I’ve processed have surprisingly small vocabularies. Not sure whether I’m missing something obvious…

Here’s what I’ve been doing:

  1. Concatenate all training data
  2. Run tools/learn_bpe.lua (default 30k, space tokenized) on data from step 1
  3. Run tools/tokenize.lua -bpe_model using model from step 2 & data from step 1
  4. Run tools/build_vocab.lua (default 50k, space tokenized) on output from step 3

Resulting vocab size is always significantly less than 30k. Given that we learned 30k BPE tokens on some data, shouldn’t we have a vocab of at least 30k when tokenizing that same data? :thinking: What am I missing?