Size of subwords vocabulary?

I am using sentencepiece algorithm to implement subwords in my nmt model. Currently I have around 1million parallel sentences, and for that I am using 10k vocab size of BPE using sentencepiece. Now I want to experiment with 2 Million sentences then what should be my BPE vocab size? Should it be 20k, 15k? And how do we decide that?


The best is to run experiments on your dataset with different vocabulary size.

In general, I found that multiple of 8k are pretty common: 8k, 16k, 32k, etc.

@guillaumekln Ok. When I have 10 k bpe vocab size, my nmt model after preprocessing is showing a vocab of around 31k for both src and that. However when I have 15 k bpe vocab size, the vocab size from nmt preprocessing goes up to 42k… I’m not able to decide until when should I increase bpe vocab size? Corresponding nmt vocab size that is obtained?

I was searching for an answer to this kind of question myself and found this paper from 2020. The authors propose a heuristics to decide the optimal BPE size for an NMT system and give an explanation on why certain vocabulary sizes are better than others.