English to Chinese

Hello everyone,I am trying to translate from English to Chinese.There are 14005403 training sets and 44298 verification sets.Here is my process.

1.Using nltk.word_tokenize participle in English,Chinese uses Jieba participle.

2.Use fast_align for word alignment and delete Chinese and English sentence pairs with alignment ratio less than 1/2

Some of the data after processing are as follows:
in recent years , the frequency and magnitude of major disasters , whether of a natural , technological or ecological origin , have made the world community aware of the immense loss of human life and economic resources that are regularly caused by such calamities .
15 . 最近 几年 , 由 自然 、 技术 或 生态 造成 的 灾害 的 频繁 程度 和 规模 使 国际 社会意识 到 由于 此类 灾害 经常 对 人 的 生命 和 经济 资源 造成 的 巨大损失 。

particularly hard hit are developing countries , for which the magnitude of disasters frequently outstrips the ability of the society to cope with them .
发展中国家 特别 深受其害 , 对 它们 而言 , 灾害 的 规模 经常 超出 了 它们 应付 的 能力 。

it was stated that this was due to the fact that 95 per cent of all disasters occurred in developing countries .
据称 这 是因为 在 所有 的 灾害 中 95% 发生 在 发展中国家 。

3.Finally, openNMT-py is used to preprocess and train.
This is my training parameter.
python train.py -data $model -save_model ./demo-model -batch_size=32 -learning_rate=0.1 -train_steps=600000 -gpu_ranks 0

4.Translation result

The result of translation is not very satisfactory. Is there a better plan? Thank you for your comments.

In addition, if there is a good Chinese-English model, I would like to buy it.

1 Like

You should probably try to train a larger model, see for example:


@guillaumekln Thank you for your reply,I got a good model through training.However, I found that training 5 million and training 14 million corpus, the size of the model is the same.Should I increase the number of layers of ENC / DEC and the size of RNN hidden state.Can I have some suggestions on the training parameters of large corpus.Thanks again.

The size of the model does not depend on the size of the data.

If you have lots of data, you can indeed increase the hidden dimensions. Based on the link above, you could double the values of:

  • -word_vec_size
  • -rnn_size
  • -transformer_ff
  • -heads

to train a “big” Transformer model.