Bpe, vocab size

XiangZhu0718 · October 27, 2017, 8:35am

For an English-Chinese system , after using bpe preprocesing for the English source text , what’s the usual size of source vocab ? My source text is 10M sentences, and the size of source vocab is 2544 , is it correct?

guillaumekln · October 27, 2017, 8:40am

Hello,

What version of the code are you using?

XiangZhu0718 · October 27, 2017, 9:00am

hello , i first use learn_bpe.lua to learn the bpe model and use tokenize.lua for the preprocession. And then use the preprocess.py of pytorch to generate the vovab.

guillaumekln · October 27, 2017, 9:09am

Could you post the exact commit you are using on the Lua version?

cd OpenNMT
git rev-parse HEAD

XiangZhu0718 · October 27, 2017, 9:13am

1f64569c9b0e5b7b262066228db2984392e37125 ?

guillaumekln · October 27, 2017, 9:15am

Thanks. You should update to the latest commit and re-learn your BPE model.

XiangZhu0718 · October 27, 2017, 9:16am

it’s not newset ? and what’s the difference ? thank you

guillaumekln · October 27, 2017, 9:20am

There was an issue when loading a BPE model.

Actually, you can just update the code and not re-learn the BPE model if you trained it with this commit 1f64569c9b0e5b7b262066228db2984392e37125 precisely.

XiangZhu0718 · October 27, 2017, 9:22am

sorry, i still do not know well . which code you mean i need to updata ? the tokenize.lua or the learn_bpe.lua ?

guillaumekln · October 27, 2017, 9:23am

Just update OpenNMT.

cd OpenNMT
git pull

XiangZhu0718 · October 27, 2017, 9:28am

after commad ‘git pull’, i find the changes in BPE.lua . is it uesd in learn_bpe model and the tokenize.lua?
what i am curious about the usual size of vocab for english corpus after bpe .

jean.senellart · November 1, 2017, 7:36am

the size of vocab is generally close to the number of rules in BPE: i.e. if you select 30000 parameter for BPE, your vocabulary will be usually 30000-32000.

XiangZhu0718 · November 1, 2017, 7:46am

thank you, jean.
i tried the 100000 and 300000 parameter for BPE, the vocabulary is the same , i’m confused.

jean.senellart · November 1, 2017, 7:49am

100000 and 300000 parameters are far too big - the usual values are between 16-30K. With such parameter, you simply disable BPE effect.

XiangZhu0718 · November 1, 2017, 7:57am

well, in sogou 2017 for WMT , in the article,‘we uesd bpe segmentation to process both source and target data. 300K subword system symbols are used for the source side’. what does 300K mean? thks.

jean.senellart · November 1, 2017, 8:38am

I cannot tell for the Chinese side because it depends on their word segmentation, but for the English, it is meaningless.

XiangZhu0718 · November 1, 2017, 8:42am

ok, thank you for your advice . for the english side, it also uses large number like 150K, so i’m confused.

jean.senellart · November 1, 2017, 9:25am

I’m confused also , I just read their paper here - and they don’t give any clue why they choose such parameters and what difference it made compared to configuration without BPE. maybe the best is to ask them directly…

XiangZhu0718 · November 1, 2017, 9:28am

well, thks again for your time.

anderleich · August 2, 2021, 11:30am

Hi,
I have one question regarding the size of the vocabulary. I’m using BPE tokenization model trained on 32k operations. When processing the corpus it results in more tokens than the BPE operations (say 35.5k). Is this ok? Shouldn’t we have a closed vocabulary of 32k tokens?