How Much Does Tokenization Affect Neural Machine Translation?

JptoEn · June 20, 2021, 9:58pm

Hi everyone, I came across this paper today, which shows huge increase in BLEU score when using the Moses Tokenizer (and MeCab for Japanese) over SentencePiece or the inbuilt OpenNMT implementation.

This was surprising to me as I didn’t think it would make such a big difference. Has anyone here benchmarked the various tokenizers?

I hope this paper is informative to you.

guillaumekln · June 21, 2021, 5:40pm

Hi,

You can find other comparisons here:

github.com

google/sentencepiece/blob/master/doc/experiments.md

# SentencePiece Experiments

## Experiments 1 (subword vs word-based model)
### Experimental settings

*   Segmentation algorithms:
    *   **SentencePiece**: SentencePiece with a language-model based segmentation. (`--model_type=unigram`)
    *   **SentencePeice(BPE)**: SentencePiece with Byte Pair Encoding. [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]] (`--model_type=bpe`)
    *   **Moses**: [Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl) for English.
    *   **KyTea**: [KyTea](http://www.phontron.com/kytea/) for Japanese.
    *   **MeCab**: [MeCab](http://taku910.github.io/mecab/) for Japanese.
    *   **neologd**: [MeCab with neologd](https://github.com/neologd/mecab-ipadic-neologd) for Japanese.
    *   **(Moses/KyTea)+SentencePiece**: Apply SentencePiece (Unigram) to pre-tokenized sentences. We have several variants with different tokenizers., e.g., **(Moses/MeCab)+SentencePiece**, **(MeCab/Moses)+SentencePiece**.
    *   *char**: Segments sentence by characters.

*   Data sets:
    *   [KFTT](http://www.phontron.com/kftt/index.html)

*   NMT parameters: ([Google’s Neural Machine Translation System](https://arxiv.org/pdf/1609.08144.pdf) is applied for all experiments.)
    *   Dropout prob: 0.2

This file has been truncated. show original

Contrary to the paper, this page shows little differences between each tokenization methods. In particular you could check the second experiment where “BPE(MosesPretok)” should correspond to the “Moses tokenizer” in the paper, and “Unigram(WsPretok)” to “SentencePiece”.