The supplied BPE tokenizer in the PyTorch version adds the separator at the end of non-last subword units. However, depending on the language, linguistically speaking you may be more inclined to add the separator to the front of non-first subword units or even use both prefixes and suffixes. I see that the lua tokenizer allows more freedom of the prefix/suffix distinction but that’s about it.
My question is not necessarily whether these options will also be ported to the apply_bpe.py scripts (although that would be interesting), but rather whether any research has been done on the position of the separator. Is performance related to the position? Is it language-dependent? Etc.
We made some brief tests long time ago with prepended separators for a few languages (De, Hu) and got slightly worse BLEU scores. But I would not consider it conslusive. We have not gone into the issue any further.
Well, I don’t really know. Initially we had the vague idea that if you prepend the separator then in morphologically rich languages like Hungarian you might avoid a lot of stems being added to the vocabulary in two forms (‘stem’ and ‘stem@@’) because the separator would be attached to the suffixes (so you only have ‘stem’ + ‘@@suf’ instead of (‘stem@@’ + ‘suf’ and also ‘stem’) in the vocabulary). And since you don’t normally have ‘suf’ as a separate word in the language, with prepended separators you might end up with a more compact vocabulary. But in the end it didn’t seem to help.
That’s a fair assumption, although “suffixes” can surely be present in the vocabulary if you have a compounding language (I am not familiar with Hungarian so I’m not sure whether it has agglutinative compounds).
Sure, compounds happily complicate the issue further. So there might not be a theoretical justification for either of the approaches and you can’t avoid some experiments in concrete cases to find a specific optimal setup.