Difference between bpe and character-level tokenization for BLUE score

Hi~I have a question. I’m studying English-to-Korean NMT model. and I tested some of the tokenization methods. For example, BPE 16K(English)-Character token(Korean)/BPE 16K(English)-BPE 16K(Korean). The best performance was to use character-level tokenization for Korean when the target language is Korean. so, I think the character-level tokenization for Korean is best. but the outputs of the two models looked similar. The only difference was the tokenization method. so, I decoded the output of the model which use the BPE 16K(English)-BPE 16K(Korean) and tokenized it into character level. then I evaluate Bleu score with the Korean test data that be tokenized into character-level. Then the Bleu score of the result improved by about 10. I wonder why this is happening. Did I miscalculate the Bleu score??

snippet of Character token(Korean)

아 까 해 운 대 에 서 찍 은 사 진 좀 보 내 줄 수 있 어 ?
또 올 해 복 학 하 기 전 에 보 라 카 이 를 다 녀 왔 어 요 .
인 터 넷 으 로 인 해 우 리 의 삶 은 많 이 변 했 어 요 .

snippet of BPE 16K(Korean)

▁아까 ▁해운대 에서 ▁찍은 ▁사진 ▁좀 ▁보내줄 ▁수 ▁있어 ?
▁또 ▁올해 ▁복 학 하기 ▁전에 ▁보라카이 를 ▁다녀왔어요 .
▁인터넷으로 ▁인해 ▁우리의 ▁삶은 ▁많이 ▁변 했어요 .


This thread seems relevant:

TL;DR: you should not compare BLEU score on different tokenization.

Thank you!!!