We changed the tokenizer to the same data and increased my BLEU score significantly.
We study for Korean(ko) and English(en) NMT.
We found quite a remarkable achievement.
We Use 500,000 pair corpus en, ko data, train to dbrnn.
Use OpenNMT Tokenization
en2ko : 17.77 +/-5.03 BLEU = 18.33, 55.6/27.2/13.2/6.0 (BP=0.985, ratio=0.985, hyp_len=783, ref_len=795)
ko2en : 23.82 +/-3.87 BLEU = 23.74, 56.7/31.7/18.0/9.8 (BP=1.000, ratio=1.018, hyp_len=1012, ref_len=994)
Use Mecab-ko Tokenization after OpenNMT Tokenization
en2ko : 29.27 +/-4.80 BLEU = 29.35, 66.0/37.1/22.1/13.7 (BP=1.000, ratio=1.008, hyp_len=839, ref_len=832)
ko2en : 41.07 +/-6.81 BLEU = 41.66, 72.6/50.2/34.3/24.1 (BP=1.000, ratio=1.011, hyp_len=1048, ref_len=1037)
Since Mecap is a Japanese tokenizer, it is expected to have similar effects when it is used in Japanese.
Could I submit the pull request?