before making a comment, can you please explain what was wrong is this usage (based on current master):
th tools/tokenize.lua -mode aggressive -nparallel 6 < $f > data/$file.rawtok
cat data/*.rawtok | th tools/learn_bpe.lua -tok_case_feature
-size $bpe_size -bpe_mode both -save_bpe data/train-$sl$tl.bpe$bpe_size
th tools/tokenize.lua -case_feat -segment_case -mode aggressive -joiner_annotate -nparallel 6
$bpe_model < $f > data/$file.tok
It was not working (generating too small vocabs with buil_vocab.lua)
Am I missing -bpe_mode in the tokenize cmd line ?
EDIT: indeed this was my missing option in the tokenize cmd line.
you should forget about v2,it has been online only 2 weeks so very few users were impacted.
v3 is un-necessary.
We could stick to the old way, many models were trained that way, and the new way could be just slightly modified, making your standard markers the default ones, and if they could be forced by options.