Small vocabs when running build_vocab.lua on BPE tokenized data

dbl · June 5, 2018, 10:32am

I’ve noticed that the past few data sets I’ve processed have surprisingly small vocabularies. Not sure whether I’m missing something obvious…

Here’s what I’ve been doing:

Concatenate all training data
Run tools/learn_bpe.lua (default 30k, space tokenized) on data from step 1
Run tools/tokenize.lua -bpe_model using model from step 2 & data from step 1
Run tools/build_vocab.lua (default 50k, space tokenized) on output from step 3

Resulting vocab size is always significantly less than 30k. Given that we learned 30k BPE tokens on some data, shouldn’t we have a vocab of at least 30k when tokenizing that same data? What am I missing?