Thanks very much for your replies. Actually, I’m not using SentencePiece at all. What I do is to pretokenise, then I tokenise using the Sennrich-based BPE model learner the OpenNMT API provides.
In any case, I agree with you, @miguelknals, in that this is rather a formatting issue. However, hard spaces are closely linked to the language (for example, some end punctuation marks should be separated this way in French, and sometimes thousands separator as well), and I’m pretty sure models can learn where to place/apply them. Actually, this could be quite beneficial for post-editing scenarios, since hard spaces are often difficult to check and correct (at least, more than other more evident in-line formatting tags).
That said, I’m not sure whether there is anything we can do with subword tokenisation, which does its job interpreting spaces as initial word boundaries. Thinking twice about it, I would say this is rather a normalisation/pretokenisation issue, with the potential risk you mention of having an inconsistent representation of the same meaning (e.g. “page 2” versus “page ￭((nbsp)) ￭2”). But I suppose models could learn both patterns without major problems if there are enough samples.
Maybe it does not make much sense, but it would be nice if there was a way to encode hard spaces in some special manner that retains the extra information hard spaces convey while avoiding the use of a special token surrounded by joiners to represent them. For example, a special mark that preserves the space (so models can generalise well), but translates into a hard space instead of a regular space when detokenising. I suppose this have implications, but I wonder if this could even be considered as a potential improvement/option for the OpenNMT tokeniser…