Tokenizer v1.20.0 with SentencePiece v0.1.92 potentially problematic?

baume · October 3, 2020, 10:28am

Hi,
playing around with SentencePiece v0.1.92 I found some rather strange behavior with Vocabulary Restriction. As an example :

wget --trust-server-names http://data.statmt.org/wmt16/translation-task/training-parallel-nc-v11.tgz
tar xfz training-parallel-nc-v11.tgz
cd training-parallel-nc-v11/
mv news-commentary-v11.de-en.de de-en.de
mv news-commentary-v11.de-en.en de-en.en
cat de-en.en de-en.de | shuf > de-en
spm_train --input=de-en --model_prefix=spm --vocab_size=32000 --model_type=unigram --character_coverage=1.0
spm_encode --model=spm.model --generate_vocabulary < de-en.de > spm_vocab.de
spm_encode --model=spm.model --generate_vocabulary < de-en.en > spm_vocab.en

As a result, spm_vocab.[de|en] does not contain a vocabulary but the encoded content of de-en.[de|en], i.e. the parameter –genererate_vocabulary does not work as expected.
Obviously, SentencePiece v0.1.92 is not tagged “Verified” yet. I checked the same procedure with v0.1.91 and it produced the vocab files.
I don’t think this should be a problem for Tokenizer’s subword handling but just found it worth to notice.

Kind regards,
Martin

guillaumekln · October 3, 2020, 11:21am

Hi,

Yes I found that as well when looking at Tokenizer (sp_model, vocabulary_threshold) with unexpected results.

It’s probably a small bug in spm_encode and should not be an issue for the SentencePiece integration in the Tokenizer. Maybe you could open an issue on the SentencePiece repository?

Note that “Verified” means the following for a GitHub release:

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

It does not mean the release has been tested or checked.

baume · October 3, 2020, 11:40am

Hi Guillaume,
thanks for the clarification of the GitHub status “Verified” - classical misunderstanding of mine.
Yes, I could try to open an issue on the SentencePiece repository. Just need to check which prerequisites are necessary - membership etc.
Best, Martin

baume · October 3, 2020, 1:15pm

GitHub issue already exists for this subject :
–> Vocabulary generation no longer works? #531

guillaumekln · October 3, 2020, 2:11pm

Nice. I’m proposing a fix here:

baume · October 3, 2020, 5:52pm

Tiny fix with striking impact - perfekt.