You need to get some publicly available parallel corpus which sentence-algined, normalized and tokenized ( you can do on your own if you want).
You can find the corpus that was used for WMT-2017 (http://www.statmt.org/wmt17/) or the Europarl Corpus (http://www.statmt.org/europarl/) . or WMT’14 (http://www.statmt.org/wmt14/translation-task.html) .
So, for training the src-train.txt should contain the training data in source language , tgt-train.txt contain training data in target language. For, a descent nmt model this data should have atleast 2.5 m sentence pairs.
The validation texts, src-valid.txt and tgt-valid.txt should typically consist of around 500 to 3000 sentences. These validation texts are used for evaluating the convergence of the model. You can create validation texts on your own if its not available in your corpus, but just make sure the sentences are parallel-aligned, normalized and tokenized.
And the test data, could be something around 2500 sentences.