Chinese and English translation

Good friends, all of you,
I am currently trying to use OpenNMT to translate English into Chinese, but there are many problems.
Chinese corpus 2000 sentence, Chinese participle(for example :今年 是 维多利亚 的 秘密 时尚 秀 10 周年纪念 而 如今 它 却 有着 20 亿 的 观众 收看 还有 模特 的 数量),The 2000 sentence of the sample of English Corpus(for example :It is the 10th anniversary of the Victoria ’ s Secret Fashion Show . but now it ’ s been watched by more than 2 billion people the numbers of models we got .)Corresponding Chinese Corpus,Two verification files 200 sentences from Chinese corpus sample and English Corpus example, respectively.Then according to OpenNMT tutorial, preprocessing > Training Model > translation.Finally, we extract 30 data from Chinese corpus sample to test data.The effect is very bad,Only a few single words can be identified,Repeating words,Thank you for all kinds of guidance and ideas

Torch >>>>> translation journal:
[04/03/18 10:05:33 INFO] SENT 232: 浣??搴 浣姣 ?浜瑙?濂?
[04/03/18 10:05:33 INFO] PRED 232: I watched watched watched .
[04/03/18 10:05:33 INFO] PRED SCORE: -10.66
[04/03/18 10:05:33 INFO]
[04/03/18 10:05:33 INFO] Translated 1759 words, src unk count: 232, coverage: 13.1%, tgt words: 1373 words, tgt unk count: 0, coverage: 0%,
[04/03/18 10:05:33 INFO] PRED AVG SCORE: -2.30, PRED PPL: 9.94

PyTorch >>>>> translation journal:
root@iZj6c3p7zpoi634e71k0cuZ:/workspace/OpenNMT-py# python -model …/zh-cn-py/ -src data/src-test.txt -outpu
t pred.txt -replace_unk -verbose
You are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the ‘pip install --upgrade pip’ command.
You are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the ‘pip install --upgrade pip’ command.
Loading model parameters.
average src size 8.666666666666666 9/workspace/OpenNMT-py/onmt/modules/ UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
align_vectors =*targetL, sourceL))/root/python/lib/python3.6/site-packages/torch/nn/modules/ UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
input = module(input)

SENT 5: ('浣?, ‘寤’, '彖?, ‘浠’, ‘锛’, '浣?, ‘珑’, ‘浠’, '姝?, '?
')PRED 5: I watched watched watched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PRED SCORE: -3.2823

PRED AVG SCORE: -0.5602, PRED PPL: 1.7510

Traceback (most recent call last):
File “”, line 152, in
File “”, line 137, in main
_report_score(‘PRED’, pred_score_total, pred_words_total)
File “”, line 30, in _report_score
name, score_total / words_total,
ZeroDivisionError: float division by zero

Hi @MrWanghui,

One of the main problems you have is that 2000 sentences for training is too few sentences to build a good translation model.

Also, you have to be carefull with codification for chinese, what do you use as translation unit? words, bpe tokens, or characters? Changing that can help you to obtain better results too.

Good luck!

1 Like

@emartinezVic Thank you very much for your advice. I used 3 sentences to repeat the bilingual experiment,Repeat 500 times,Finally, the data in the input sample is satisfactory.Now it has been proved,The number of training is not enough.

I wish you a happy life and a smooth work!

1 Like

Looks like some of the Chinese characters on your platform is broken/corrupt… As emartinezVic suggested, there could be something wrong with the codification of these characters. And yes, 2K sentences might be a very small size for a corpus if you are trying to build a translation model. Glad I got access to the sample corpus from Ace Chinese Translation. They are kind enough to support my project, as they believe my work would to some degree help improvement their productivity…

Hi man, I am currently working on translating English to Chinese using OpenNMT; I face the exact problem with you. Could you please give some advice about the process?

I built a small dataset based on the quick start dataset to ensure OpenNmt is feasible. My training corpus is 10,000 sentences, and the evaluation corpus is 3,000 sentences. My only change is using SentencePiece tokenizer instead of the default tokenizer and configuring this in the config file as well. But the result seems not so good; I take some sentences from the training file and use my model to infer these sentences. I also get some repeat words, and the translated sentences are not related to the origin.

Thanks a lot,

10,000 sentences is probably to little data, ideally you would have 1,000,000+ sentences. Opus can be a good source of data of you need more.