Small vocabs when running build_vocab.lua on BPE tokenized data


(David Landan) #1

I’ve noticed that the past few data sets I’ve processed have surprisingly small vocabularies. Not sure whether I’m missing something obvious…

Here’s what I’ve been doing:

  1. Concatenate all training data
  2. Run tools/learn_bpe.lua (default 30k, space tokenized) on data from step 1
  3. Run tools/tokenize.lua -bpe_model using model from step 2 & data from step 1
  4. Run tools/build_vocab.lua (default 50k, space tokenized) on output from step 3

Resulting vocab size is always significantly less than 30k. Given that we learned 30k BPE tokens on some data, shouldn’t we have a vocab of at least 30k when tokenizing that same data? :thinking: What am I missing?