BytePairEncoding


(srush) #1

Implement BPE (looks like it is almost done)


(jean.senellart) #3

tokenizer.lua can now apply a BPE model as following:

echo "Les chaussettes de l'archiduchesse sont-elles sèches ?" | 
    th tools/tokenize.lua -mode aggressive -bpe_model test/tokenization/fr500.bpe -joiner_annotate
Les ch■ au■ ss■ et■ tes de l ■'■ ar■ ch■ i■ du■ ch■ es■ se sont ■-■ elles s■ è■ ch■ es ?

is the marker showing the “split”

The option is compatible with case feature option:

echo "Les chaussettes de l'archiduchesse sont-elles sèches ?" | 
   th tools/tokenize.lua -case_feature -mode aggressive -bpe_model test/tokenization/fr500.bpe -joiner_annotate
les│C ch■│L au■│L ss■│L et■│L tes│L de│L l│L ■'■│N ar■│L ch■│L i■│L du■│L ch■│L es■│L se│L sont│L ■-■│N elles│L s■│L è■│L ch■│L es│L ?│N

and since the tokenization is reversible - the following generates back the original sentence:

echo "Les chaussettes de l'archiduchesse sont-elles sèches ?" | 
   th tools/tokenize.lua -case_feature -mode aggressive -bpe_model test/tokenization/fr500.bpe -joiner_annotate | 
   th tools/detokenize.lua -case_feature
Les chaussettes de l'archiduchesse sont-elles sèches ?

See here for details about the tokenizer.


#4

Can I actually train a BPE model with OpenNMT or only use an existing one?


(jean.senellart) #5

For the moment, you have to train first the BPE model with Sennrich’s Subword Neural Machine Translation scripts here. We are working on a BPE model trainer with some new features (for instance for using BPE for Chinese) that will be released soon.


#6

Great! Thanks for your quick reply!