Input_sentence_size parameter into the spm.SentencePieceTrainer.Train

witchinghour · May 31, 2023, 1:52pm

Help me figure out the parameter “input_sentence_size”, what exactly is it used for, how should it be set and why? How does it affect tokenization?

guillaumekln · May 31, 2023, 2:01pm

github.com/google/sentencepiece

[Q] Would you elaborate the differences of 4 corpus size parameters?

opened 08:00AM - 02 Nov 17 UTC

closed 03:45PM - 17 Dec 17 UTC

deasuke

I have 50M sentences corpus to train. I'd like to know the difference of followi…ng parameters. I set 50M, 10M, 5M, 50M respectively (x5 of default) and got crashed like issue#4 -- CHECK(!pieces.empty()) failed on serialize. The vocab size I set was 32768. ``` --input_sentence_size (maximum size of sentences the trainer loads) type: int32 default: 10000000 --mining_sentence_size (maximum size of sentences to make seed sentence piece) type: int32 default: 2000000 --seed_sentencepiece_size (the size of seed sentencepieces) type: int32 default: 1000000 --training_sentence_size (maximum size of sentences to train sentence pieces) type: int32 default: 10000000 ```