Hello dear BPE users,
As the recent change in the way of handling BPE related options in learn_bpe.lua and tokenizer.lua has caused some issues within the OpenNMT community, we decide to add this topic to inform those who are concerned by this change about the current situation.
There have been 3 versions of OpenNMT’s BPE models:
V1; models with header type: “true;true;true;conservative”
- Original purpose of adding this header is to ensure the compatibility while learning BPE models and applying them on raw texts.
- Still supported in tokenizer.lua: which means with these models, the following options from cmd are overridden: -bpe_mode, -bpe_case_insensitive. However, the 4th option ‘conservative’ is ignored. One who is still using these v1 models will have to specify the option ‘-mode’ with the appropriate value in tokenize.lua 's command line to make it work with the current master of OpenNMT.
- Depreciated in learn_bpe.lua because of confusion in practice.
V2; models with no header
- Current master version
- All tokenizer options and BPE options should be put explicitly in cmd of tokenize.lua
- The tokenizer options and BPE options used for the cmd of tokenize.lua should be the same as for learn_bpe.lua except for joiner related options.
V3; models with header type: “v3;true;true;true;<w>;</w>” (upcoming version)
- Put only BPE related options in the header: -bpe_mode, -bpe_case_insensitive, -bpe_BOT_marker, -bpe_EOT_marker
- No need to specify BPE options for tokenize.lua while applying BPE, all options are loaded from the BPE model, and all cmd options related to BPE are overridden.
- Users should ensure the compatibility for tokenizer options for learn_bpe.lua and tokenizer.lua
All you suggestions are welcome.