BPE options handling in learn_bpe.lua and tokenizer.lua

Hello dear BPE users,

As the recent change in the way of handling BPE related options in learn_bpe.lua and tokenizer.lua has caused some issues within the OpenNMT community, we decide to add this topic to inform those who are concerned by this change about the current situation.

There have been 3 versions of OpenNMT’s BPE models:

  • V1; models with header type: “true;true;true;conservative”

    • Original purpose of adding this header is to ensure the compatibility while learning BPE models and applying them on raw texts.
    • Still supported in tokenizer.lua: which means with these models, the following options from cmd are overridden: -bpe_mode, -bpe_case_insensitive. However, the 4th option ‘conservative’ is ignored. One who is still using these v1 models will have to specify the option ‘-mode’ with the appropriate value in tokenize.lua 's command line to make it work with the current master of OpenNMT.
    • Depreciated in learn_bpe.lua because of confusion in practice.
  • V2; models with no header

    • Current master version
    • All tokenizer options and BPE options should be put explicitly in cmd of tokenize.lua
    • The tokenizer options and BPE options used for the cmd of tokenize.lua should be the same as for learn_bpe.lua except for joiner related options.
  • V3; models with header type: “v3;true;true;true;<w>;</w>” (upcoming version)

    • Put only BPE related options in the header: -bpe_mode, -bpe_case_insensitive, -bpe_BOT_marker, -bpe_EOT_marker
    • No need to specify BPE options for tokenize.lua while applying BPE, all options are loaded from the BPE model, and all cmd options related to BPE are overridden.
    • Users should ensure the compatibility for tokenizer options for learn_bpe.lua and tokenizer.lua

All you suggestions are welcome.

1 Like

before making a comment, can you please explain what was wrong is this usage (based on current master):

th tools/tokenize.lua -mode aggressive -nparallel 6 < $f > data/$file.rawtok
cat data/*.rawtok | th tools/learn_bpe.lua -tok_case_feature
-size $bpe_size -bpe_mode both -save_bpe data/train-$sl$tl.bpe$bpe_size
th tools/tokenize.lua -case_feat -segment_case -mode aggressive -joiner_annotate -nparallel 6
$bpe_model < $f > data/$file.tok

It was not working (generating too small vocabs with buil_vocab.lua)

Am I missing -bpe_mode in the tokenize cmd line ?

EDIT: indeed this was my missing option in the tokenize cmd line.

IMO:
you should forget about v2,it has been online only 2 weeks so very few users were impacted.

v3 is un-necessary.
We could stick to the old way, many models were trained that way, and the new way could be just slightly modified, making your standard markers the default ones, and if they could be forced by options.

Thanks for your response, you confirm the need for -bpe_mode both

As suggestions for your current usage, it would be better if you keep the same tokenizer options for preparing rawtok (for learn_bpe.lua) and tok files (-segement_case is only used for tok but not for rawtok in your case). But you can keep the current usage as the cases impacted by segment_case may not be very often in your data.

You can also use the same tokenizer options directly in learn_bpe.lua to learn BPE models directly from the raw text.

Some remarks for v3: models with header type: “v3;true;true;true;<w>;</w>” (upcoming version)

  • We will update the master to v3 for internal reasons and to make cleaner the way handling BPE options
  • The v1 models will still be supported
  • Only instruction: let what’s BPE to BPE, and keep the same tokenizer options everywhere, except for joiner related options
1 Like