Unicode normalization type

relyer · October 10, 2022, 2:08pm

Hello there.

I got a question during preparing datasets pair about UTF8 normalization. Which type of UNICODE normalization most preferred for tokenizer (like sentencesplitter). NFC, NFD or something other. Mb exist some research about this.

thx.

guillaumekln · October 11, 2022, 7:59am

Hi,

Did you mean SentencePiece?

Note that SentencePiece already applies a NFKC normalization by default: sentencepiece/normalization.md at master · google/sentencepiece · GitHub

relyer · October 12, 2022, 9:46am

Sure.

thx for answer.