Bpe, vocab size

For an English-Chinese system , after using bpe preprocesing for the English source text , what’s the usual size of source vocab ? My source text is 10M sentences, and the size of source vocab is 2544 , is it correct?


What version of the code are you using?

hello , i first use learn_bpe.lua to learn the bpe model and use tokenize.lua for the preprocession. And then use the of pytorch to generate the vovab.

Could you post the exact commit you are using on the Lua version?

cd OpenNMT
git rev-parse HEAD

1f64569c9b0e5b7b262066228db2984392e37125 ?

Thanks. You should update to the latest commit and re-learn your BPE model.

it’s not newset ? and what’s the difference ? thank you

There was an issue when loading a BPE model.

Actually, you can just update the code and not re-learn the BPE model if you trained it with this commit 1f64569c9b0e5b7b262066228db2984392e37125 precisely.

sorry, i still do not know well . which code you mean i need to updata ? the tokenize.lua or the learn_bpe.lua ?

Just update OpenNMT.

cd OpenNMT
git pull

after commad ‘git pull’, i find the changes in BPE.lua . is it uesd in learn_bpe model and the tokenize.lua?
what i am curious about the usual size of vocab for english corpus after bpe .

the size of vocab is generally close to the number of rules in BPE: i.e. if you select 30000 parameter for BPE, your vocabulary will be usually 30000-32000.

thank you, jean.
i tried the 100000 and 300000 parameter for BPE, the vocabulary is the same , i’m confused.

100000 and 300000 parameters are far too big - the usual values are between 16-30K. With such parameter, you simply disable BPE effect.

well, in sogou 2017 for WMT , in the article,‘we uesd bpe segmentation to process both source and target data. 300K subword system symbols are used for the source side’. what does 300K mean? thks.

I cannot tell for the Chinese side because it depends on their word segmentation, but for the English, it is meaningless.

ok, thank you for your advice . for the english side, it also uses large number like 150K, so i’m confused.

I’m confused also :slight_smile:, I just read their paper here - and they don’t give any clue why they choose such parameters and what difference it made compared to configuration without BPE. maybe the best is to ask them directly…

well, thks again for your time.

I have one question regarding the size of the vocabulary. I’m using BPE tokenization model trained on 32k operations. When processing the corpus it results in more tokens than the BPE operations (say 35.5k). Is this ok? Shouldn’t we have a closed vocabulary of 32k tokens?