BytePairEncoding

srush · December 23, 2016, 2:13pm

Implement BPE (looks like it is almost done)

jean.senellart · December 23, 2016, 11:49pm

tokenizer.lua can now apply a BPE model as following:

echo "Les chaussettes de l'archiduchesse sont-elles sèches ?" | 
    th tools/tokenize.lua -mode aggressive -bpe_model test/tokenization/fr500.bpe -joiner_annotate

Les ch￭ au￭ ss￭ et￭ tes de l ￭'￭ ar￭ ch￭ i￭ du￭ ch￭ es￭ se sont ￭-￭ elles s￭ è￭ ch￭ es ?

￭ is the marker showing the “split”

The option is compatible with case feature option:

echo "Les chaussettes de l'archiduchesse sont-elles sèches ?" | 
   th tools/tokenize.lua -case_feature -mode aggressive -bpe_model test/tokenization/fr500.bpe -joiner_annotate

les￨C ch￭￨L au￭￨L ss￭￨L et￭￨L tes￨L de￨L l￨L ￭'￭￨N ar￭￨L ch￭￨L i￭￨L du￭￨L ch￭￨L es￭￨L se￨L sont￨L ￭-￭￨N elles￨L s￭￨L è￭￨L ch￭￨L es￨L ?￨N

and since the tokenization is reversible - the following generates back the original sentence:

echo "Les chaussettes de l'archiduchesse sont-elles sèches ?" | 
   th tools/tokenize.lua -case_feature -mode aggressive -bpe_model test/tokenization/fr500.bpe -joiner_annotate | 
   th tools/detokenize.lua -case_feature

Les chaussettes de l'archiduchesse sont-elles sèches ?

See here for details about the tokenizer.

mehmedes · January 16, 2017, 1:46pm

Can I actually train a BPE model with OpenNMT or only use an existing one?

jean.senellart · January 16, 2017, 1:59pm

For the moment, you have to train first the BPE model with Sennrich’s Subword Neural Machine Translation scripts here. We are working on a BPE model trainer with some new features (for instance for using BPE for Chinese) that will be released soon.

mehmedes · January 16, 2017, 2:02pm

Great! Thanks for your quick reply!