Bpe for chinese characters?

my Chinese sentence has been segment by spaces.

I tried to use bpe with Chinese, but when I ran learn_bpe.lua, only the English word was be encoded, chinese characters was no change.

thanks!

I have tried google’s sentencepiece, the results are the same as learn_bpe.lua.

hello @netxiao

I am running a first experiment on English to Chinese.

Whay are you trying BPE on already segmented text ?

do you expect some gain using subwords ?

by the way, if you have an existing experiment into Chinese, what kind of PPL do you have at convergence ?

Thanks

my last model ppl is: 5.89, I used bpe for the first time.:grinning:

in my last test, bpe with chinese worked fine.

Dear @netxiao

I would suggest to use character-based encoding on the Chinese side.
You can have the English BPE-ed, but for Chinese I think it is better to limit the dictionary.

Cheers,
Dimitar

thanks for your answer. you are right, I will test character-based for chinese side.