English to Chinese

Jie · October 8, 2019, 1:45am

Hello everyone,I am trying to translate from English to Chinese.There are 14005403 training sets and 44298 verification sets.Here is my process.

1.Using nltk.word_tokenize participle in English，Chinese uses Jieba participle.

2.Use fast_align for word alignment and delete Chinese and English sentence pairs with alignment ratio less than 1/2

Some of the data after processing are as follows:
in recent years , the frequency and magnitude of major disasters , whether of a natural , technological or ecological origin , have made the world community aware of the immense loss of human life and economic resources that are regularly caused by such calamities .
15 . 最近几年，由自然、技术或生态造成的灾害的频繁程度和规模使国际社会意识到由于此类灾害经常对人的生命和经济资源造成的巨大损失。

particularly hard hit are developing countries , for which the magnitude of disasters frequently outstrips the ability of the society to cope with them .
发展中国家特别深受其害，对它们而言，灾害的规模经常超出了它们应付的能力。

it was stated that this was due to the fact that 95 per cent of all disasters occurred in developing countries .
据称这是因为在所有的灾害中 95% 发生在发展中国家。

3.Finally, openNMT-py is used to preprocess and train.
This is my training parameter.
python train.py -data $model -save_model ./demo-model -batch_size=32 -learning_rate=0.1 -train_steps=600000 -gpu_ranks 0

4.Translation result
7FA7135F-6117-458E-86C0-8ED70B4AAC2B

The result of translation is not very satisfactory. Is there a better plan? Thank you for your comments.

In addition, if there is a good Chinese-English model, I would like to buy it.

guillaumekln · October 8, 2019, 8:04am

You should probably try to train a larger model, see for example:

http://opennmt.net/OpenNMT-py/FAQ.html#how-do-i-use-the-transformer-model

Jie · November 25, 2019, 2:33am

@guillaumekln Thank you for your reply,I got a good model through training.However, I found that training 5 million and training 14 million corpus, the size of the model is the same.Should I increase the number of layers of ENC / DEC and the size of RNN hidden state.Can I have some suggestions on the training parameters of large corpus.Thanks again.

guillaumekln · November 25, 2019, 8:40am

The size of the model does not depend on the size of the data.

If you have lots of data, you can indeed increase the hidden dimensions. Based on the link above, you could double the values of:

-word_vec_size
-rnn_size
-transformer_ff
-heads

to train a “big” Transformer model.