Vocab size | sentencepiece VS preprocessor

paulkp · March 20, 2020, 1:54am

Q1. So if I’m using sentencepiece and telling it to use a vocab size of 32K should I also set the vocab size in the onmt pre-proceessor to be the same value?
Q2. I don’t see this in the saved demos/examples
Q3. What vocab size do you guys use with sentence piece? eg en & de

@guillaumekln ?
@francoishernandez ?

Bachstelze · March 20, 2020, 3:58pm

To Q1 and Q2: You shouldn’t set the vocab size in the preprocessing after it is defined with the initial calculated vocab file.
To Q3: A vocab size of 32K is good for one language pair and is used in the recent research about english and german translation systems. If you increase your model size then you should also increase your vocab size.

paulkp · March 20, 2020, 11:26pm

But are you referring to sentence piece followed by onmt preprocessing in particular?
Sentence piece asks you to set a vocab size AND then so does onmt preprocessor (default = 50,000) which generates the vocab.pt files ready for training.
Should these vocab settings MATCH?

@guillaumekln @francoishernandez @Bachstelze

paulkp · March 23, 2020, 2:28am

I’m still not sure what to do re vocab size in sentencepiece AND onmt-py preprocessing.
I need to run both of course.
Do I set them to the same value OR WHAT?
What does the onmt-py preprocessor do vocab-wise?
@guillaumekln
@francoishernandez
@Bachstelze

francoishernandez · March 23, 2020, 8:54am

Preprocessing builds a frequency counter of all the tokens encountered in the data.
The vocab_size options are rather ‘max’ vocab sizes than set vocab sizes. If there are more tokens in the data than the max vocab_size allowed, it will only retain the vocab_size more frequent ones, and the other will be mapped to <unk>.

ajitesh3 · February 3, 2021, 2:37pm

I am planning to train transformer based NMT model on a corpus roughly carrying 5 million sentence pair.
Based upon the english german system I am thinking to keep sentencepiece BPE vocab size =32k
The vocab size in OpenNMT by default is 50k. Should I let it be 50k or need to increase it as 5 million is quite a large corpus. Also i am planning to train it for 300k steps.

Previously, I trained transformer model with 2million corpus size and 24k Sentencepiece BPE vocab size. and trained for 200k steps. It seem to work good for me.

francoishernandez · February 4, 2021, 6:10pm

This is actually the “max” vocab size. If your true vocab is 32k, it will be 32k.
Though, by “BPE vocab size = 32k” you may mean “BPE merge operations = 32k”, which is not exactly the same thing. Your vocab might end up being bigger.

ajitesh3 · February 7, 2021, 4:27pm

@francoishernandez
Thank you for replying.
I am using BPE based model in sentencepiece, where we can define the model vocab size with argument

vocab_size . So, If I put vocab_size=32k ,

I wanted to know this thing,
Previously I trained NMT model on a corpus size of 2Million with Sentencepiece vocab size=24k and trained upto 200k steps. The results for good.
Now I am planning to train on 5Million corpus with 300k steps and 32k vocab size. Should it be good.? I understand these are experimental basis, no fix value, just wanted to have an idea before I start(as it will a costly experiment training upto 300k)

francoishernandez · February 8, 2021, 9:29am

Indeed, sentencepiece does output an exact vocab size.

It’s indeed very likely that an experiment with 5M data will be better than with 2M, provided that your additional data is relatively clean.

ajitesh3 · February 8, 2021, 11:33am

Hi @francoishernandez
I could not understand one thing. Like you said 32k are BPE merge operations hence final vocab may be more.ok but when do the build vocab and start nmt training I am gettting follwing vocab size

Counters src:61583
[2021-02-08 09:48:27,023 INFO] Counters tgt:38519

Any idea why is there so much difference in src and tgt vocab numbers when I used same vocab size during sentencepiece BPE model training.

francoishernandez · February 8, 2021, 1:27pm

IIRC sentencepiece does not properly split unknown tokens, and instead keeps them as a standalone token (which is quite strange). You can probably check this in your output vocab, you would have a lot of rare words with low frequency which are not properly split.