The difference between parameter

toadzhou · December 29, 2016, 9:05am

train_src and train_src
train_tgt and valid_tgt

Would you please tell me what the difference between them, I will do some test data.But I don’t know what to do.Can give some examples of Chinese to English?The smallest data is ok.thank you

jinyeqiong · December 29, 2016, 12:31pm

“train” means train corpus, and “valid” means develop corpus which adjusts parameters. You can choose a part of train corpus as valid corpus, but it’s best that the train corpus doesn’t cover the valid corpus. In other words, the train corpus, the valid corpus and the test corpus are different. I don’t know if it’s right for my understand, because I’m also a beginner. And I hope it’s useful for you.

guillaumekln · December 29, 2016, 1:39pm

For more details about training/validation/test sets, see here:

toadzhou · December 30, 2016, 1:58am

Thanks for the solution!
Would you like to anwser a few questions
For example, now I have 1 billion sentences, should I take 500 million to do the training, 250 million to do validation, 250 million to do the test?
In addition, for big data , Would you have any suggestions for distributed deployment in cluster?

guillaumekln · December 30, 2016, 3:14pm

You don’t need that many sentences to train a strong NMT model (if that is the task you are planning to do). You could use 5 to 10 million sentences for training, 2000 for validation and 2000 for testing.

For clustering, do you mean it for training or handling translation requests?

toadzhou · January 3, 2017, 2:13am

Now I have 10+ billion sentences and 10 server(8GPU/server).
The machine is use the same SAN storege.
I want to know how to use the 8 * 10 GPU training at the same time.

toadzhou · January 3, 2017, 3:03am

translation requests is testing？

guillaumekln · January 3, 2017, 11:17am

We don’t support distributed training on multiple servers. However, we do support multi-GPU training. See:

http://opennmt.net//Guide/#parallel-training

How much CPU memory do you have? 10 billion is a lot of data. We usually work with corpus containing 1 million to 20 million sentences.

toadzhou · January 4, 2017, 12:50am

Hardware configuration ：10 x (E5-2660 v3 x2、256G memory、TESLA K40x8)
Sentence points in different industries
The number of sentences is very large, in order to he can better learning in order to improve the precision.