Size of subwords vocabulary?

ajitesh3 · October 9, 2019, 7:18am

Hi,
I am using sentencepiece algorithm to implement subwords in my nmt model. Currently I have around 1million parallel sentences, and for that I am using 10k vocab size of BPE using sentencepiece. Now I want to experiment with 2 Million sentences then what should be my BPE vocab size? Should it be 20k, 15k? And how do we decide that?

guillaumekln · October 9, 2019, 1:40pm

Hi,

The best is to run experiments on your dataset with different vocabulary size.

In general, I found that multiple of 8k are pretty common: 8k, 16k, 32k, etc.

ajitesh3 · October 9, 2019, 2:14pm

@guillaumekln Ok. When I have 10 k bpe vocab size, my nmt model after preprocessing is showing a vocab of around 31k for both src and that. However when I have 15 k bpe vocab size, the vocab size from nmt preprocessing goes up to 42k… I’m not able to decide until when should I increase bpe vocab size? Corresponding nmt vocab size that is obtained?

ccll · July 7, 2021, 3:36pm

I was searching for an answer to this kind of question myself and found this paper from 2020. The authors propose a heuristics to decide the optimal BPE size for an NMT system and give an explanation on why certain vocabulary sizes are better than others.