Tokenization: separator on suffix or prefix

BramVanroy · October 1, 2020, 10:18am

The supplied BPE tokenizer in the PyTorch version adds the separator at the end of non-last subword units. However, depending on the language, linguistically speaking you may be more inclined to add the separator to the front of non-first subword units or even use both prefixes and suffixes. I see that the lua tokenizer allows more freedom of the prefix/suffix distinction but that’s about it.

My question is not necessarily whether these options will also be ported to the apply_bpe.py scripts (although that would be interesting), but rather whether any research has been done on the position of the separator. Is performance related to the position? Is it language-dependent? Etc.

oraveczcsaba · October 1, 2020, 11:04am

We made some brief tests long time ago with prepended separators for a few languages (De, Hu) and got slightly worse BLEU scores. But I would not consider it conslusive. We have not gone into the issue any further.

BramVanroy · October 1, 2020, 9:05pm

Thanks for the reply! Is there any theoretical reason why to choose one over the other?

oraveczcsaba · October 2, 2020, 7:08am

Well, I don’t really know. Initially we had the vague idea that if you prepend the separator then in morphologically rich languages like Hungarian you might avoid a lot of stems being added to the vocabulary in two forms (‘stem’ and ‘stem@@’) because the separator would be attached to the suffixes (so you only have ‘stem’ + ‘@@suf’ instead of (‘stem@@’ + ‘suf’ and also ‘stem’) in the vocabulary). And since you don’t normally have ‘suf’ as a separate word in the language, with prepended separators you might end up with a more compact vocabulary. But in the end it didn’t seem to help.

BramVanroy · October 2, 2020, 7:21am

That’s a fair assumption, although “suffixes” can surely be present in the vocabulary if you have a compounding language (I am not familiar with Hungarian so I’m not sure whether it has agglutinative compounds).

oraveczcsaba · October 2, 2020, 10:32am

Sure, compounds happily complicate the issue further. So there might not be a theoretical justification for either of the approaches and you can’t avoid some experiments in concrete cases to find a specific optimal setup.

guillaumekln · October 2, 2020, 4:50pm

The OpenNMT Tokenizer also has an option to detach the separator from the tokens:

https://github.com/OpenNMT/Tokenizer/blob/master/docs/options.md#joiner_new-boolean-default-false

This solves the vocabulary issue but increases the length of the sequences.