BytePairEncoding

Implement BPE (looks like it is almost done)

tokenizer.lua can now apply a BPE model as following:

echo "Les chaussettes de l'archiduchesse sont-elles sèches ?" | 
    th tools/tokenize.lua -mode aggressive -bpe_model test/tokenization/fr500.bpe -joiner_annotate
Les ch■ au■ ss■ et■ tes de l ■'■ ar■ ch■ i■ du■ ch■ es■ se sont ■-■ elles s■ è■ ch■ es ?

is the marker showing the “split”

The option is compatible with case feature option:

echo "Les chaussettes de l'archiduchesse sont-elles sèches ?" | 
   th tools/tokenize.lua -case_feature -mode aggressive -bpe_model test/tokenization/fr500.bpe -joiner_annotate
les│C ch■│L au■│L ss■│L et■│L tes│L de│L l│L ■'■│N ar■│L ch■│L i■│L du■│L ch■│L es■│L se│L sont│L ■-■│N elles│L s■│L è■│L ch■│L es│L ?│N

and since the tokenization is reversible - the following generates back the original sentence:

echo "Les chaussettes de l'archiduchesse sont-elles sèches ?" | 
   th tools/tokenize.lua -case_feature -mode aggressive -bpe_model test/tokenization/fr500.bpe -joiner_annotate | 
   th tools/detokenize.lua -case_feature
Les chaussettes de l'archiduchesse sont-elles sèches ?

See here for details about the tokenizer.

1 Like

Can I actually train a BPE model with OpenNMT or only use an existing one?

For the moment, you have to train first the BPE model with Sennrich’s Subword Neural Machine Translation scripts here. We are working on a BPE model trainer with some new features (for instance for using BPE for Chinese) that will be released soon.

Great! Thanks for your quick reply!