English to Chinese training

(LM) #1


I have tried several ways to tokenize Chinese content during preprocess. However, because I used space to tokenize Chinese content, my translated file has spaces. I’d like to remove those inserted space during tokenization.

I didn’t find sufficient information on how to use detokenize.lua.

Does anyone have any solution?



(Eva) #2

Hi Lily,
you can tokenize using '@@ ’ instead of only using a space ’ ', in that way your translation model will produce '@@ ’ instead of spaces and you will be able to use detokenize.lua with the -joiner @@ option to reconstruct your output.

Note that you can use ‘@@’ as separator token or whatever token that is not seen in your data, for instance, you can use the tokenizer/detokenizer default marker ‘■’ .

(LM) #3

Hi Eva,

Thank you so much for your help.

It seems like if I use OpenNMT’s tokenizer, then detokenizer works fine.

I found that the tokenizer from jieba (https://github.com/fxsjy/jieba) works better than the ones offered in OpenNMT. So what I did was I used jieba to tokenize my input content first. But OpenNMT’s detokenizer didn’t work on my tokenized files from jieba.

Do you know why this is the case?


(Eva) #4

Hi Lily,

the thing is that you should use the same tokenizer to tokenize and to detokenize.
If you tokenize with jieba, OpenNMT-detokenizer will not work properly because it will expect a tokenization obtained with OpenNMT-tokenizer. And the same will happen the other way round.

In your case, you should use the jieba tokenizer to tokenize and to detokenize.