Bpe model is applied before tokenization?

I am doing some experiments using bpe model. My training corpus originally is trucased and not tokenized.

To implement translation engine, first I train the bpe model using original corpus.
Later I tokenize training, dev and test using case_feature and bpe_model options.

The problem is that as the bpe model is trained on original corpus, the bpe model contains words as: “t .”, “s ,”, "you ’ " etc. Words contains punctuation marks.

After training, I test my model using the following command:
th tools/rest_translation_server.lua -gpuid 1 -model _epoch19_3.45.t7 -case_feature -port 9000 -host -joiner_annotate -bpe_model bpe_config.en -segment_numbers

The tokenization is different in the following examples:
cats,dogs,frogs -> cat■│L s■│L ,■│N dog■│L s■│L ,■│N frogs│L
cats, dogs, frogs -> cats│L ■,│N dogs│L ■,│N frogs│L

The tokenization change if there are spaces after comma or not. It seems that the tokenizer.lua first apply the bpe model and later makes typical tokenization with white spaces. Shouldn’t it be on the contrary? First tokenize and later apply bpe model?

Maybe should I train the bpe model on tokenized, lowercased corpus to avoid words as “t .”?


Hi @Sasanita !
the bpe model learns the most common character combinations in order to be able later to segment a given word.
If you want to treat the punctuation marks like '! . , ; : ’ etc as isolated words, because you don’t want to have bpe segments like ‘t,’ or ‘s.’ etc., you will need to first tokenize your data and afterwards learn the bpe model.
As I see it, it has more sense to first tokenize, trucase, etc. your data and then learn/apply the bpe model.

Anyway, the important thing here is to apply tokenization, truecasing and bpe in the same order as you did to your training data.

Looking at the rest_translation_server.lua code, it seems like it first tokenizes the input and afterwards it bpe-nizes the tokenized input.

However, some time ago @icortes reported some problems with tokenization process when using the rest_translation_server.lua here:

maybe that post can help you too.



The default tokenization mode (“conservative” mode) does not split “cats,dogs,frogs”:

$ echo "cats,dogs,frogs" | th tools/tokenize.lua
Tokenization completed in 0.001 seconds - 1 sentences

This is to keep tokens like “10,000”.

When using BPE, it seems common to set the “aggressive” tokenization mode instead:

$ echo "cats,dogs,frogs" | th tools/tokenize.lua -mode aggressive
cats , dogs , frogs
Tokenization completed in 0.001 seconds - 1 sentences