The supplied BPE tokenizer in the PyTorch version adds the separator at the end of non-last subword units. However, depending on the language, linguistically speaking you may be more inclined to add the separator to the front of non-first subword units or even use both prefixes and suffixes. I see that the lua tokenizer allows more freedom of the prefix/suffix distinction but that’s about it.
My question is not necessarily whether these options will also be ported to the apply_bpe.py scripts (although that would be interesting), but rather whether any research has been done on the position of the separator. Is performance related to the position? Is it language-dependent? Etc.