Hard spaces lost when tokenizing

dmarin · July 19, 2021, 8:38am

Hi everyone,

I just noticed that hard spaces (i.e. non-breaking spaces) are lost when tokenising the data using OpenNMT Tokenizer and BPE, which is a bit of a problem for my use case. The reason seems to be that the vocabularies created via BPE normalise those hard spaces as normal spaces, so the hard spaces do not appear in the vocabulary as other tokens. I know I could protect/restore them with pre- and postprocessing scripts, but I would like to know if a more straightforward solution exists for this, and if not, whether you see any non-desirable side effect in protecting them as special tokens (e.g. the model being confused by longer token sequences without normal spaces).

Many thanks in advance.

SamuelLacombe · July 19, 2021, 11:27am

Hello, in this case, I guess you will have to ask that directly on the github of sentencePiece as the problem is concerning sentencePiece.

miguelknals · July 19, 2021, 2:00pm

I think there is an OpenNMT tokenizer and sentencepiece as different options, so not sure it is just a sentencepiece issue.

In any case, the non breaking space is more a format issue, is not a translation issue, at the end is just a “blank”. Also a non breaking space that do not behaves as blank can create issues. For instance, “Page 2” with a non breaking space that does not behave as a blank will remain as “Page 2”, a strange behaviour when probably there will be “Page 2” instances with plain blanks that will be (correctly) segmented as “Page” and “2”.

I think SentencePiece has some option to allow you some custom normalization (maybe to replace you non breaking blanks to other othing (for instance a symbol). If not, probably a custom script.

dmarin · July 19, 2021, 3:16pm

Thanks very much for your replies. Actually, I’m not using SentencePiece at all. What I do is to pretokenise, then I tokenise using the Sennrich-based BPE model learner the OpenNMT API provides.

In any case, I agree with you, @miguelknals, in that this is rather a formatting issue. However, hard spaces are closely linked to the language (for example, some end punctuation marks should be separated this way in French, and sometimes thousands separator as well), and I’m pretty sure models can learn where to place/apply them. Actually, this could be quite beneficial for post-editing scenarios, since hard spaces are often difficult to check and correct (at least, more than other more evident in-line formatting tags).

That said, I’m not sure whether there is anything we can do with subword tokenisation, which does its job interpreting spaces as initial word boundaries. Thinking twice about it, I would say this is rather a normalisation/pretokenisation issue, with the potential risk you mention of having an inconsistent representation of the same meaning (e.g. “page 2” versus “page ￭((nbsp)) ￭2”). But I suppose models could learn both patterns without major problems if there are enough samples.

Maybe it does not make much sense, but it would be nice if there was a way to encode hard spaces in some special manner that retains the extra information hard spaces convey while avoiding the use of a special token surrounded by joiners to represent them. For example, a special mark that preserves the space (so models can generalise well), but translates into a hard space instead of a regular space when detokenising. I suppose this have implications, but I wonder if this could even be considered as a potential improvement/option for the OpenNMT tokeniser…

guillaumekln · July 19, 2021, 4:06pm

Non-breaking spaces are mostly formatting details, so they should be treated differently in the source and the target data.

In the source, I think you always want to ignore them as they don’t convey information. Otherwise the model should learn multiple representations of the same sentence, as you pointed out.

In the target, it makes sense to handle these characters for the output to be formatted correctly. You could replace these characters by special tokens and replace them again in the output. But that means you have to check your target training data to always use these non-breaking spaces correctly otherwise they would not be generated consistently. That’s why it’s generally easier to ignore them.

Alternatively, you could write a few postprocessing rules to convert standard spaces to non-breaking spaces when required (at least for French, you could easily cover most cases with few rules).

dmarin · July 19, 2021, 4:49pm

Thanks a lot, @guillaumekln. I will try your suggestions.