Using unigram model with sentencepiece - vocabulary size

kargintima · December 17, 2019, 3:55pm

I have prepared corpus with 29k src words and 71k tgt words.
So, I am trying to use sentencepiece to make a more compact vocabulary.
I heard that vocabulary size should be about 24-32k elements.
But Even then I try to use 3k elements - almost all words do not separates.
This is how I started my experiments:

spm.SentencePieceTrainer.Train('--input=data/raw/data.en --model_prefix=data/en --vocab_size=2000')

spm.SentencePieceTrainer.Train('--input=data/raw/data.ru --model_prefix=data/ru --vocab_size=2000')
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.SentencePieceLearner(vocab_size=2000, character_coverage=0.98)

learner.ingest_file("data/raw/data.en")
tokenizer = learner.learn("data/en.model")
tokenizer.tokenize_file("data/raw/train.en", "data/raw/train.tok.en", num_threads=4)

learner.ingest_file("data/raw/data.ru")
tokenizer = learner.learn("data/ru.model")
tokenizer.tokenize_file("data/raw/train.ru", "data/raw/train.tok.ru", num_threads=4)

Sample of tokenized train set:

▁at ▁the ▁same ▁time ▁we ▁must ▁ensure ▁that ▁low income ▁countries ▁use ▁debt ▁relief ▁and ▁aid ▁efficient ly
▁it ▁is ▁always ▁ris k y ▁to ▁write ▁about ▁e x change ▁rates
▁but ▁if ▁they ▁lea v e ▁it ▁will ▁also ▁ es c al ate ▁only ▁fa st er
▁his ▁cha ri s ma ▁did ▁not ▁predict ▁defeat ▁the ▁change ▁in ▁follow ers ▁needs ▁did
▁what ▁does ▁the ▁approach ▁of ▁a ▁single ▁regulator ▁ im p ly ▁for ▁inno v ation ▁and ▁new ▁idea s
▁me l bo ur n e ▁did ▁you ▁ma k e ▁any ▁new ▁year ▁s ▁resolution s
▁is ol ation ism ▁is ▁a ▁fa m ili ar ▁re f ra in ▁in ▁us ▁foreign ▁policy ▁among ▁those ▁element s ▁of ▁the ▁right ▁that ▁consider ▁the ▁us ▁too ▁good ▁for ▁the ▁world ▁as ▁well ▁as ▁among ▁those ▁on ▁the ▁left ▁who ▁consider ▁america ▁a ▁de st ru c ti v e ▁global ▁force
▁so ▁to ▁build ▁a ▁credible ▁nuclear ▁ ar s en al ▁iran ▁would ▁need ▁a ▁decade ▁or ▁longer
▁we ▁must ▁face ▁the ▁fact s ▁our ▁emissions ▁of ▁gree n h ous e ▁ga s es ▁probab ly ▁are ▁at ▁least ▁part ly ▁to ▁blame ▁for ▁this ▁s um m er ▁of ▁e x treme s
▁the ▁c um ul ati v e ▁result ▁of ▁all ▁these ▁national ▁ob j ection s ▁is ▁that ▁the ▁ n ice ▁summit ▁is ▁li k ely ▁to ▁see ▁only ▁a ▁modest ▁increase ▁in ▁the ▁potential ▁for ▁ma j ority ▁ v ot ing ▁much ▁small er ▁than ▁enlargement ▁re q uires

Looks fine. But how it could work with 32k vocabulary?

kargintima · December 19, 2019, 8:54am

Or maybe I should use some pretrained model for Unigram/BPE?
Here for example.

Bachstelze · December 19, 2019, 11:10am

Yes, you can use word embeddings in tensorflow and pytorch. Especial in your low-resource setting with a 100k word corpus.

Looks fine. But how it could work with 32k vocabulary?

You can increase the vocabulary size in the preprocessing step, therefore you get longer subwords or complete words as vocabulary.