Size of subwords vocabulary?

I am using sentencepiece algorithm to implement subwords in my nmt model. Currently I have around 1million parallel sentences, and for that I am using 10k vocab size of BPE using sentencepiece. Now I want to experiment with 2 Million sentences then what should be my BPE vocab size? Should it be 20k, 15k? And how do we decide that?


The best is to run experiments on your dataset with different vocabulary size.

In general, I found that multiple of 8k are pretty common: 8k, 16k, 32k, etc.

@guillaumekln Ok. When I have 10 k bpe vocab size, my nmt model after preprocessing is showing a vocab of around 31k for both src and that. However when I have 15 k bpe vocab size, the vocab size from nmt preprocessing goes up to 42k… I’m not able to decide until when should I increase bpe vocab size? Corresponding nmt vocab size that is obtained?