I have prepared corpus with 29k src words and 71k tgt words.
So, I am trying to use sentencepiece to make a more compact vocabulary.
I heard that vocabulary size should be about 24-32k elements.
But Even then I try to use 3k elements - almost all words do not separates.
This is how I started my experiments:
spm.SentencePieceTrainer.Train('--input=data/raw/data.en --model_prefix=data/en --vocab_size=2000')
spm.SentencePieceTrainer.Train('--input=data/raw/data.ru --model_prefix=data/ru --vocab_size=2000')
tokenizer = pyonmttok.Tokenizer("aggressive", joiner_annotate=True, segment_numbers=True)
learner = pyonmttok.SentencePieceLearner(vocab_size=2000, character_coverage=0.98)
learner.ingest_file("data/raw/data.en")
tokenizer = learner.learn("data/en.model")
tokenizer.tokenize_file("data/raw/train.en", "data/raw/train.tok.en", num_threads=4)
learner.ingest_file("data/raw/data.ru")
tokenizer = learner.learn("data/ru.model")
tokenizer.tokenize_file("data/raw/train.ru", "data/raw/train.tok.ru", num_threads=4)
Sample of tokenized train set:
▁at ▁the ▁same ▁time ▁we ▁must ▁ensure ▁that ▁low income ▁countries ▁use ▁debt ▁relief ▁and ▁aid ▁efficient ly
▁it ▁is ▁always ▁ris k y ▁to ▁write ▁about ▁e x change ▁rates
▁but ▁if ▁they ▁lea v e ▁it ▁will ▁also ▁ es c al ate ▁only ▁fa st er
▁his ▁cha ri s ma ▁did ▁not ▁predict ▁defeat ▁the ▁change ▁in ▁follow ers ▁needs ▁did
▁what ▁does ▁the ▁approach ▁of ▁a ▁single ▁regulator ▁ im p ly ▁for ▁inno v ation ▁and ▁new ▁idea s
▁me l bo ur n e ▁did ▁you ▁ma k e ▁any ▁new ▁year ▁s ▁resolution s
▁is ol ation ism ▁is ▁a ▁fa m ili ar ▁re f ra in ▁in ▁us ▁foreign ▁policy ▁among ▁those ▁element s ▁of ▁the ▁right ▁that ▁consider ▁the ▁us ▁too ▁good ▁for ▁the ▁world ▁as ▁well ▁as ▁among ▁those ▁on ▁the ▁left ▁who ▁consider ▁america ▁a ▁de st ru c ti v e ▁global ▁force
▁so ▁to ▁build ▁a ▁credible ▁nuclear ▁ ar s en al ▁iran ▁would ▁need ▁a ▁decade ▁or ▁longer
▁we ▁must ▁face ▁the ▁fact s ▁our ▁emissions ▁of ▁gree n h ous e ▁ga s es ▁probab ly ▁are ▁at ▁least ▁part ly ▁to ▁blame ▁for ▁this ▁s um m er ▁of ▁e x treme s
▁the ▁c um ul ati v e ▁result ▁of ▁all ▁these ▁national ▁ob j ection s ▁is ▁that ▁the ▁ n ice ▁summit ▁is ▁li k ely ▁to ▁see ▁only ▁a ▁modest ▁increase ▁in ▁the ▁potential ▁for ▁ma j ority ▁ v ot ing ▁much ▁small er ▁than ▁enlargement ▁re q uires
Looks fine. But how it could work with 32k vocabulary?