Hi. Can anyone here please explain to me, how SentencePiece is capable of utilzing Unigram LM/BPE to generate a vocabulary of subwords? As far as I understand both require that you have access to a initial vocabulary, consisting of words. Often acquired through means of pre-tokenziation on a “whitespace” level. But from the papers I’ve read, its utilizing raw input strings as input and avoids word segmentation issues associated with that multiple languages do not utilize “whitespace” as a word seperator.
I don’t understand how you can use Unigram LM/BPE, without this pre-tokenization. As far as i understand BPE requires word frequencies to perform character merges, and Unigram needs a initial vocabulary of words that it can in a iterative manner trim down based on words of shorter length. Have SentencePiece implemented a custom version of Unigram/BPE to deal with these on a sequence of unicode characters? I find very little information about how this actually is implemented, beyond " SentencePiece implements subword units (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]) with the extension of direct training from raw sentences."