I’ve noticed that the past few data sets I’ve processed have surprisingly small vocabularies. Not sure whether I’m missing something obvious…
Here’s what I’ve been doing:
- Concatenate all training data
- Run tools/learn_bpe.lua (default 30k, space tokenized) on data from step 1
- Run tools/tokenize.lua -bpe_model using model from step 2 & data from step 1
- Run tools/build_vocab.lua (default 50k, space tokenized) on output from step 3
Resulting vocab size is always significantly less than 30k. Given that we learned 30k BPE tokens on some data, shouldn’t we have a vocab of at least 30k when tokenizing that same data? What am I missing?