Best Practice for Splitting Datasets

What is the best practice for combining disparate datasets and splitting them into training, validation and testing datasets as it relates to NMT? What I have been doing is using the cat command on multiple datasets to combine them and then split that into the three files that I need by using the split command.
I guess that it would be better to put everything into a python dataframe which I can then use the sklearn.model_selection.train_test_split twice to divide it up properly and randomly. How do people on here go about combining and splitting up their datasets? Is there a tool or script that makes this more straightforward?

The only problem with “cat | split” is that the data might be imbalanced among your various input files. You want to randomize the entire set- this is a big part of what train_test_split does. Also, you don’t need dataframe, you can just make arrays.

you may have a look at this
We created the multiple dataset option for this purpose.

I agree completely that you can end up with a not-mixed validation set if you do it that way. How are you inputting your data if not through dataframes/read_csv, are you manually reading it through a file read function? I always seem to get errors with the read_csv which I need to make it skip lines in order to fix. This causes a ton of alignment issues so I just have been using the cat/split solution.

Ah! I understand about dirty data. It is possible to shuffle datasets with Unix command line tools. There is a unix program ‘shuf’ on Linux.

In the paper Data Ordering Patterns for Neural Machine Translation: An Empirical Study the perplexity ordering gains +1.8 BLEU on the IWSLT2015 English-Vietnamese language data set. The pretrained XLM-R could be perfect for preprocessing the training data for a large scale NMT. And there is already a function in the which calculates the perplexity:

def evaluate_clm(self, scores, data_set, lang1, lang2):
    Evaluate perplexity and next word prediction accuracy.