OpenNMT Forum

Training chinese-english



hi @emartinezVic . I am training chinese-english with following details .
-batch_size 8 -layers 4 -rnn_size 1000 -learning_rate 0.001 -opt adam -train_steps 105000
rest default configurations.
I have a training data of 840000 ,val 105000, test 105000.
What all improvement I can do ?

(Eva) #2

Hi @pankitbhat,

there are many different things that you can do :slight_smile:

I list here some hints:

  • Amount of data: usually people use 1000-3000 sentences for test and development sets. Having about 1000000 sentences for training seems enough to obtain a good model (depending on the quality of the data).

  • kind and quality of the data: in general, data are preprocessed before starting building any NLP models: tokenize, remove bilingual duplicates, remove too long or too short sentences maybe, etc. After that, in order to have a more controlled vocabulary, data are segmented using morphologically motivated methods (morfessor, FlatCat) or frequency based methods like Byte Pair Endoding (BPE). I think for Chinese people usually use other kind of segmenting methods, or even work at character level. I suggest you to make a little research on this :wink:

  • It is important that the development set is kind of related to your test set. Your model will be trained to optimize a loss function on the development set, so it will learn to generalize on this kind of data. Thus, if you try to translate texts from other domain you may not get a translation as good as you expected.

  • Regarding training parameters, increasing the batch size will speed up your training. Also, I would monitor the loss function on both development and training sets in order to get an idea of how is the model learning going.

And I think those are the general ideas I can tell you without knowing more of your particular translation scenario :wink:

Nevertheless, I suggest you to do some research and read works that explain how they set up successful Chinese-English NMT systems, they will give you better ideas of improvement for sure :wink:

Good luck !


Thankyou so much @emartinezVic . I will look into it. Meanwhile is there any contact you could drop in so that i could further contact you ?