Vocab size questions

paulkp · March 11, 2020, 12:46am

Hi all . . including @guillaumekln . .

Q1. How do I determine the vocab size after opennmt-py pre-processing?
I can see a vocab.pt file.
But how can I Iist the size or actual items?

Q2. Using the perl tokenizer tool, I still get lots of numeric vocab. How can I get it to tokenize at digit level etc? Are there help files for these tools beyond the brief readme?

I know about sentencePiece. I just want to compare to sentencePiece but at least stop 1000s of numeric vocab entries . .

Q3. Or should I use this?

francoishernandez · March 11, 2020, 11:42am

Hi @paulkp

Q1. You can have a look at my reply here. Once you have the vocab you can access its attributes itos, stoi and freqs (see torchtext.data.Vocab docs for more details).

Q2/3. I think there is a flag to segment numbers by digit in OpenNMT/Tokenizer, which should probably solve your issue.