English to Chinese training

Hi,

I have tried several ways to tokenize Chinese content during preprocess. However, because I used space to tokenize Chinese content, my translated file has spaces. I’d like to remove those inserted space during tokenization.

I didn’t find sufficient information on how to use detokenize.lua.

Does anyone have any solution?

Thanks,

Lily

Hi Lily,
you can tokenize using '@@ ’ instead of only using a space ’ ', in that way your translation model will produce '@@ ’ instead of spaces and you will be able to use detokenize.lua with the -joiner @@ option to reconstruct your output.

Note that you can use ‘@@’ as separator token or whatever token that is not seen in your data, for instance, you can use the tokenizer/detokenizer default marker ‘■’ .

1 Like

Hi Eva,

Thank you so much for your help.

It seems like if I use OpenNMT’s tokenizer, then detokenizer works fine.

I found that the tokenizer from jieba (https://github.com/fxsjy/jieba) works better than the ones offered in OpenNMT. So what I did was I used jieba to tokenize my input content first. But OpenNMT’s detokenizer didn’t work on my tokenized files from jieba.

Do you know why this is the case?

Thanks!

Hi Lily,

the thing is that you should use the same tokenizer to tokenize and to detokenize.
If you tokenize with jieba, OpenNMT-detokenizer will not work properly because it will expect a tokenization obtained with OpenNMT-tokenizer. And the same will happen the other way round.

In your case, you should use the jieba tokenizer to tokenize and to detokenize.

Eva