Could I submit pull request to tokenization hook for Korean and Japanese?

We changed the tokenizer to the same data and increased my BLEU score significantly.

We study for Korean(ko) and English(en) NMT.
We found quite a remarkable achievement.
We Use 500,000 pair corpus en, ko data, train to dbrnn.

Use OpenNMT Tokenization

en2ko : 17.77   +/-5.03 BLEU = 18.33, 55.6/27.2/13.2/6.0 (BP=0.985, ratio=0.985, hyp_len=783, ref_len=795)
ko2en : 23.82   +/-3.87 BLEU = 23.74, 56.7/31.7/18.0/9.8 (BP=1.000, ratio=1.018, hyp_len=1012, ref_len=994)

Use Mecab-ko Tokenization after OpenNMT Tokenization

en2ko : 29.27   +/-4.80 BLEU = 29.35, 66.0/37.1/22.1/13.7 (BP=1.000, ratio=1.008, hyp_len=839, ref_len=832)
ko2en : 41.07   +/-6.81 BLEU = 41.66, 72.6/50.2/34.3/24.1 (BP=1.000, ratio=1.011, hyp_len=1048, ref_len=1037)

Since Mecap is a Japanese tokenizer, it is expected to have similar effects when it is used in Japanese.

Could I submit the pull request?

1 Like

Thanks for the PR! The review will happen on GitHub:

Thanks a lot. To extend on this experiment, it would be interesting to also make a comparison with standard sub-tokenization BPE or SentencePiece that are also integrated. Is your corpus open - so that we can run other comparative trainings?

Sorry, My boss did not allow corpus open.
We’ve already trained them, but they have not had a big effect.
But if you want other training, we will do it and release the models.

Use BPE suffix mode ko and en

ko2en : 26.16	+/-6.02	BLEU = 26.05, 58.4/33.2/20.0/11.9 (BP=1.000, ratio=1.016, hyp_len=1135, ref_len=1117)