OpenNMT Forum

Vocab size | sentencepiece VS preprocessor

Q1. So if I’m using sentencepiece and telling it to use a vocab size of 32K should I also set the vocab size in the onmt pre-proceessor to be the same value?
Q2. I don’t see this in the saved demos/examples
Q3. What vocab size do you guys use with sentence piece? eg en & de

@guillaumekln ?
@francoishernandez ?

To Q1 and Q2: You shouldn’t set the vocab size in the preprocessing after it is defined with the initial calculated vocab file.
To Q3: A vocab size of 32K is good for one language pair and is used in the recent research about english and german translation systems. If you increase your model size then you should also increase your vocab size.

But are you referring to sentence piece followed by onmt preprocessing in particular?
Sentence piece asks you to set a vocab size AND then so does onmt preprocessor (default = 50,000) which generates the vocab.pt files ready for training.
Should these vocab settings MATCH?

@guillaumekln @francoishernandez @Bachstelze

I’m still not sure what to do re vocab size in sentencepiece AND onmt-py preprocessing.
I need to run both of course.
Do I set them to the same value OR WHAT?
What does the onmt-py preprocessor do vocab-wise?
@guillaumekln
@francoishernandez
@Bachstelze

Preprocessing builds a frequency counter of all the tokens encountered in the data.
The vocab_size options are rather ‘max’ vocab sizes than set vocab sizes. If there are more tokens in the data than the max vocab_size allowed, it will only retain the vocab_size more frequent ones, and the other will be mapped to <unk>.

1 Like