SentencePiece: Understanding pre-tokenization on raw Unicode bytes?

hattmindu · August 26, 2020, 9:27am

Hi. Can anyone here please explain to me, how SentencePiece is capable of utilzing Unigram LM/BPE to generate a vocabulary of subwords? As far as I understand both require that you have access to a initial vocabulary, consisting of words. Often acquired through means of pre-tokenziation on a “whitespace” level. But from the papers I’ve read, its utilizing raw input strings as input and avoids word segmentation issues associated with that multiple languages do not utilize “whitespace” as a word seperator.

I don’t understand how you can use Unigram LM/BPE, without this pre-tokenization. As far as i understand BPE requires word frequencies to perform character merges, and Unigram needs a initial vocabulary of words that it can in a iterative manner trim down based on words of shorter length. Have SentencePiece implemented a custom version of Unigram/BPE to deal with these on a sequence of unicode characters? I find very little information about how this actually is implemented, beyond " SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences."

guillaumekln · August 26, 2020, 2:25pm

The sentence you quoted includes the link to the SentencePiece paper. Isn’t it helpful?

hattmindu · August 26, 2020, 3:42pm

Thank you for pointing it out, I found some information there that I’ve missed:
“There are several ways to prepare the seed vocabulary. The natural choice is to use the union of
all characters and the most frequent substrings in the corpus”

Not sure if this is what SentencePiece makes use of though.