Bpe, vocab size

(Xiang Zhu0718) #1

For an English-Chinese system , after using bpe preprocesing for the English source text , what’s the usual size of source vocab ? My source text is 10M sentences, and the size of source vocab is 2544 , is it correct?

(Guillaume Klein) #2


What version of the code are you using?

(Xiang Zhu0718) #3

hello , i first use learn_bpe.lua to learn the bpe model and use tokenize.lua for the preprocession. And then use the preprocess.py of pytorch to generate the vovab.

(Guillaume Klein) #4

Could you post the exact commit you are using on the Lua version?

cd OpenNMT
git rev-parse HEAD

(Xiang Zhu0718) #5

1f64569c9b0e5b7b262066228db2984392e37125 ?

(Guillaume Klein) #6

Thanks. You should update to the latest commit and re-learn your BPE model.

(Xiang Zhu0718) #7

it’s not newset ? and what’s the difference ? thank you

(Guillaume Klein) #8

There was an issue when loading a BPE model.

Actually, you can just update the code and not re-learn the BPE model if you trained it with this commit 1f64569c9b0e5b7b262066228db2984392e37125 precisely.

(Xiang Zhu0718) #9

sorry, i still do not know well . which code you mean i need to updata ? the tokenize.lua or the learn_bpe.lua ?

(Guillaume Klein) #10

Just update OpenNMT.

cd OpenNMT
git pull

(Xiang Zhu0718) #11

after commad ‘git pull’, i find the changes in BPE.lua . is it uesd in learn_bpe model and the tokenize.lua?
what i am curious about the usual size of vocab for english corpus after bpe .

(jean.senellart) #12

the size of vocab is generally close to the number of rules in BPE: i.e. if you select 30000 parameter for BPE, your vocabulary will be usually 30000-32000.

(Xiang Zhu0718) #13

thank you, jean.
i tried the 100000 and 300000 parameter for BPE, the vocabulary is the same , i’m confused.

(jean.senellart) #14

100000 and 300000 parameters are far too big - the usual values are between 16-30K. With such parameter, you simply disable BPE effect.

(Xiang Zhu0718) #15

well, in sogou 2017 for WMT , in the article,‘we uesd bpe segmentation to process both source and target data. 300K subword system symbols are used for the source side’. what does 300K mean? thks.

(jean.senellart) #16

I cannot tell for the Chinese side because it depends on their word segmentation, but for the English, it is meaningless.

(Xiang Zhu0718) #17

ok, thank you for your advice . for the english side, it also uses large number like 150K, so i’m confused.

(jean.senellart) #18

I’m confused also :slight_smile:, I just read their paper here - and they don’t give any clue why they choose such parameters and what difference it made compared to configuration without BPE. maybe the best is to ask them directly…

(Xiang Zhu0718) #19

well, thks again for your time.