Chinese and English translation

pytorch

(wanghui) #1

Good friends, all of you,
I am currently trying to use OpenNMT to translate English into Chinese, but there are many problems.
Chinese corpus 2000 sentence, Chinese participle(for example :今年 是 维多利亚 的 秘密 时尚 秀 10 周年纪念 而 如今 它 却 有着 20 亿 的 观众 收看 还有 模特 的 数量),The 2000 sentence of the sample of English Corpus(for example :It is the 10th anniversary of the Victoria ’ s Secret Fashion Show . but now it ’ s been watched by more than 2 billion people the numbers of models we got .)Corresponding Chinese Corpus,Two verification files 200 sentences from Chinese corpus sample and English Corpus example, respectively.Then according to OpenNMT tutorial, preprocessing > Training Model > translation.Finally, we extract 30 data from Chinese corpus sample to test data.The effect is very bad,Only a few single words can be identified,Repeating words,Thank you for all kinds of guidance and ideas

Torch >>>>> translation journal:
[04/03/18 10:05:33 INFO] SENT 232: 浣??搴 浣姣 ?浜瑙?濂?
[04/03/18 10:05:33 INFO] PRED 232: I watched watched watched .
[04/03/18 10:05:33 INFO] PRED SCORE: -10.66
[04/03/18 10:05:33 INFO]
[04/03/18 10:05:33 INFO] Translated 1759 words, src unk count: 232, coverage: 13.1%, tgt words: 1373 words, tgt unk count: 0, coverage: 0%,
[04/03/18 10:05:33 INFO] PRED AVG SCORE: -2.30, PRED PPL: 9.94

PyTorch >>>>> translation journal:
root@iZj6c3p7zpoi634e71k0cuZ:/workspace/OpenNMT-py# python translate.py -model …/zh-cn-py/myfile-model_acc_14.46_ppl_256.20_e6.pt -src data/src-test.txt -outpu
t pred.txt -replace_unk -verbose
You are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the ‘pip install --upgrade pip’ command.
You are using pip version 9.0.1, however version 9.0.3 is available.
You should consider upgrading via the ‘pip install --upgrade pip’ command.
Loading model parameters.
average src size 8.666666666666666 9/workspace/OpenNMT-py/onmt/modules/GlobalAttention.py:176: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
align_vectors = self.sm(align.view(batch*targetL, sourceL))/root/python/lib/python3.6/site-packages/torch/nn/modules/container.py:67: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
input = module(input)

SENT 5: ('浣?, ‘寤’, '彖?, ‘浠’, ‘锛’, '浣?, ‘珑’, ‘浠’, '姝?, '?
')PRED 5: I watched watched watched . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
PRED SCORE: -3.2823

PRED AVG SCORE: -0.5602, PRED PPL: 1.7510

Traceback (most recent call last):
File “translate.py”, line 152, in
main()
File “translate.py”, line 137, in main
_report_score(‘PRED’, pred_score_total, pred_words_total)
File “translate.py”, line 30, in _report_score
name, score_total / words_total,
ZeroDivisionError: float division by zero


(Eva) #2

Hi @MrWanghui,

One of the main problems you have is that 2000 sentences for training is too few sentences to build a good translation model.

Also, you have to be carefull with codification for chinese, what do you use as translation unit? words, bpe tokens, or characters? Changing that can help you to obtain better results too.

Good luck!
Eva


(wanghui) #3

@emartinezVic Thank you very much for your advice. I used 3 sentences to repeat the bilingual experiment,Repeat 500 times,Finally, the data in the input sample is satisfactory.Now it has been proved,The number of training is not enough.

I wish you a happy life and a smooth work!