Could I submit pull request to tokenization hook for Korean and Japanese?


(Jeongwon Hwang) #1

We changed the tokenizer to the same data and increased my BLEU score significantly.

We study for Korean(ko) and English(en) NMT.
We found quite a remarkable achievement.
We Use 500,000 pair corpus en, ko data, train to dbrnn.

Use OpenNMT Tokenization

en2ko : 17.77   +/-5.03 BLEU = 18.33, 55.6/27.2/13.2/6.0 (BP=0.985, ratio=0.985, hyp_len=783, ref_len=795)
ko2en : 23.82   +/-3.87 BLEU = 23.74, 56.7/31.7/18.0/9.8 (BP=1.000, ratio=1.018, hyp_len=1012, ref_len=994)

Use Mecab-ko Tokenization after OpenNMT Tokenization

en2ko : 29.27   +/-4.80 BLEU = 29.35, 66.0/37.1/22.1/13.7 (BP=1.000, ratio=1.008, hyp_len=839, ref_len=832)
ko2en : 41.07   +/-6.81 BLEU = 41.66, 72.6/50.2/34.3/24.1 (BP=1.000, ratio=1.011, hyp_len=1048, ref_len=1037)

Since Mecap is a Japanese tokenizer, it is expected to have similar effects when it is used in Japanese.

Could I submit the pull request?


(Guillaume Klein) #2

Thanks for the PR! The review will happen on GitHub:


(jean.senellart) #3

Thanks a lot. To extend on this experiment, it would be interesting to also make a comparison with standard sub-tokenization BPE or SentencePiece that are also integrated. Is your corpus open - so that we can run other comparative trainings?


(Jeongwon Hwang) #4

Sorry, My boss did not allow corpus open.
We’ve already trained them, but they have not had a big effect.
But if you want other training, we will do it and release the models.

Use BPE suffix mode ko and en

ko2en : 26.16	+/-6.02	BLEU = 26.05, 58.4/33.2/20.0/11.9 (BP=1.000, ratio=1.016, hyp_len=1135, ref_len=1117)