As the recent change in the way of handling BPE related options in learn_bpe.lua and tokenizer.lua has caused some issues within the OpenNMT community, we decide to add this topic to inform those who are concerned by this change about the current situation.
There have been 3 versions of OpenNMT’s BPE models:
V1; models with header type: “true;true;true;conservative”
Original purpose of adding this header is to ensure the compatibility while learning BPE models and applying them on raw texts.
Still supported in tokenizer.lua: which means with these models, the following options from cmd are overridden: -bpe_mode, -bpe_case_insensitive. However, the 4th option ‘conservative’ is ignored. One who is still using these v1 models will have to specify the option ‘-mode’ with the appropriate value in tokenize.lua 's command line to make it work with the current master of OpenNMT.
Depreciated in learn_bpe.lua because of confusion in practice.
V2; models with no header
Current master version
All tokenizer options and BPE options should be put explicitly in cmd of tokenize.lua
The tokenizer options and BPE options used for the cmd of tokenize.lua should be the same as for learn_bpe.lua except for joiner related options.
V3; models with header type: “v3;true;true;true;<w>;</w>” (upcoming version)
Put only BPE related options in the header: -bpe_mode, -bpe_case_insensitive, -bpe_BOT_marker, -bpe_EOT_marker
No need to specify BPE options for tokenize.lua while applying BPE, all options are loaded from the BPE model, and all cmd options related to BPE are overridden.
Users should ensure the compatibility for tokenizer options for learn_bpe.lua and tokenizer.lua
It was not working (generating too small vocabs with buil_vocab.lua)
Am I missing -bpe_mode in the tokenize cmd line ?
EDIT: indeed this was my missing option in the tokenize cmd line.
IMO:
you should forget about v2,it has been online only 2 weeks so very few users were impacted.
v3 is un-necessary.
We could stick to the old way, many models were trained that way, and the new way could be just slightly modified, making your standard markers the default ones, and if they could be forced by options.
Thanks for your response, you confirm the need for -bpe_mode both
As suggestions for your current usage, it would be better if you keep the same tokenizer options for preparing rawtok (for learn_bpe.lua) and tok files (-segement_case is only used for tok but not for rawtok in your case). But you can keep the current usage as the cases impacted by segment_case may not be very often in your data.
You can also use the same tokenizer options directly in learn_bpe.lua to learn BPE models directly from the raw text.