Tokenization: separator on suffix or prefix

The supplied BPE tokenizer in the PyTorch version adds the separator at the end of non-last subword units. However, depending on the language, linguistically speaking you may be more inclined to add the separator to the front of non-first subword units or even use both prefixes and suffixes. I see that the lua tokenizer allows more freedom of the prefix/suffix distinction but that’s about it.

My question is not necessarily whether these options will also be ported to the apply_bpe.py scripts (although that would be interesting), but rather whether any research has been done on the position of the separator. Is performance related to the position? Is it language-dependent? Etc.

We made some brief tests long time ago with prepended separators for a few languages (De, Hu) and got slightly worse BLEU scores. But I would not consider it conslusive. We have not gone into the issue any further.

Thanks for the reply! Is there any theoretical reason why to choose one over the other?

Well, I don’t really know. Initially we had the vague idea that if you prepend the separator then in morphologically rich languages like Hungarian you might avoid a lot of stems being added to the vocabulary in two forms (‘stem’ and ‘stem@@’) because the separator would be attached to the suffixes (so you only have ‘stem’ + ‘@@suf’ instead of (‘stem@@’ + ‘suf’ and also ‘stem’) in the vocabulary). And since you don’t normally have ‘suf’ as a separate word in the language, with prepended separators you might end up with a more compact vocabulary. But in the end it didn’t seem to help.

That’s a fair assumption, although “suffixes” can surely be present in the vocabulary if you have a compounding language (I am not familiar with Hungarian so I’m not sure whether it has agglutinative compounds).

Sure, compounds happily complicate the issue further. So there might not be a theoretical justification for either of the approaches and you can’t avoid some experiments in concrete cases to find a specific optimal setup.

1 Like

The OpenNMT Tokenizer also has an option to detach the separator from the tokens:

https://github.com/OpenNMT/Tokenizer/blob/master/docs/options.md#joiner_new-boolean-default-false

This solves the vocabulary issue but increases the length of the sequences.