The difference between parameter

(Toadzhou) #1

train_src and train_src
train_tgt and valid_tgt

Would you please tell me what the difference between them, I will do some test data.But I don’t know what to do.Can give some examples of Chinese to English?The smallest data is ok.thank you


“train” means train corpus, and “valid” means develop corpus which adjusts parameters. You can choose a part of train corpus as valid corpus, but it’s best that the train corpus doesn’t cover the valid corpus. In other words, the train corpus, the valid corpus and the test corpus are different. I don’t know if it’s right for my understand, because I’m also a beginner. And I hope it’s useful for you.

(Guillaume Klein) #3

For more details about training/validation/test sets, see here:

(Toadzhou) #4

Thanks for the solution!
Would you like to anwser a few questions
For example, now I have 1 billion sentences, should I take 500 million to do the training, 250 million to do validation, 250 million to do the test?
In addition, for big data , Would you have any suggestions for distributed deployment in cluster?

(Guillaume Klein) #5

You don’t need that many sentences to train a strong NMT model (if that is the task you are planning to do). You could use 5 to 10 million sentences for training, 2000 for validation and 2000 for testing.

For clustering, do you mean it for training or handling translation requests?

(Toadzhou) #6

Now I have 10+ billion sentences and 10 server(8GPU/server).
The machine is use the same SAN storege.
I want to know how to use the 8 * 10 GPU training at the same time.

(Toadzhou) #7

translation requests is testing?

(Guillaume Klein) #8

We don’t support distributed training on multiple servers. However, we do support multi-GPU training. See:

How much CPU memory do you have? 10 billion is a lot of data. We usually work with corpus containing 1 million to 20 million sentences.

(Toadzhou) #9

Hardware configuration :10 x (E5-2660 v3 x2、256G memory、TESLA K40x8)
Sentence points in different industries
The number of sentences is very large, in order to he can better learning in order to improve the precision.