Hi all . . including @guillaumekln . .
Q1. How do I determine the vocab size after opennmt-py pre-processing?
I can see a vocab.pt file.
But how can I Iist the size or actual items?
Q2. Using the perl tokenizer tool, I still get lots of numeric vocab. How can I get it to tokenize at digit level etc? Are there help files for these tools beyond the brief readme?
I know about sentencePiece. I just want to compare to sentencePiece but at least stop 1000s of numeric vocab entries . .
Q3. Or should I use this?