Unicode normalization type

Hello there.

I got a question during preparing datasets pair about UTF8 normalization. Which type of UNICODE normalization most preferred for tokenizer (like sentencesplitter). NFC, NFD or something other. Mb exist some research about this.



Did you mean SentencePiece?

Note that SentencePiece already applies a NFKC normalization by default: sentencepiece/normalization.md at master · google/sentencepiece · GitHub

1 Like


thx for answer.