SentencePiece vs. BPE

Through the new hook mechanism:

I introduced a hook for Google’s SentencePiece - integration with the training/inference workflow is normally seamless, see documentation here.

SentencePiece is an alternative sentence level tokenisation model described here:

The author reports interesting results - competitive with BPE for several languages especially Chinese, Japanese, Korean. This tokenisation schema can combine with normal OpenNMT tokenisations including BPE…

Please share if you have interesting results.

2 Likes