How Much Does Tokenization Affect Neural Machine Translation?

Hi everyone, I came across this paper today, which shows huge increase in BLEU score when using the Moses Tokenizer (and MeCab for Japanese) over SentencePiece or the inbuilt OpenNMT implementation.

This was surprising to me as I didn’t think it would make such a big difference. Has anyone here benchmarked the various tokenizers?

I hope this paper is informative to you.


You can find other comparisons here:

Contrary to the paper, this page shows little differences between each tokenization methods. In particular you could check the second experiment where “BPE(MosesPretok)” should correspond to the “Moses tokenizer” in the paper, and “Unigram(WsPretok)” to “SentencePiece”.