As a result, spm_vocab.[de|en] does not contain a vocabulary but the encoded content of de-en.[de|en], i.e. the parameter –genererate_vocabulary does not work as expected.
Obviously, SentencePiece v0.1.92 is not tagged “Verified” yet. I checked the same procedure with v0.1.91 and it produced the vocab files.
I don’t think this should be a problem for Tokenizer’s subword handling but just found it worth to notice.
It’s probably a small bug in spm_encode and should not be an issue for the SentencePiece integration in the Tokenizer. Maybe you could open an issue on the SentencePiece repository?
Note that “Verified” means the following for a GitHub release:
This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.
It does not mean the release has been tested or checked.
Hi Guillaume,
thanks for the clarification of the GitHub status “Verified” - classical misunderstanding of mine.
Yes, I could try to open an issue on the SentencePiece repository. Just need to check which prerequisites are necessary - membership etc.
Best, Martin