OpenNMT Forum

Best Practice for Splitting Datasets

What is the best practice for combining disparate datasets and splitting them into training, validation and testing datasets as it relates to NMT? What I have been doing is using the cat command on multiple datasets to combine them and then split that into the three files that I need by using the split command.
I guess that it would be better to put everything into a python dataframe which I can then use the sklearn.model_selection.train_test_split twice to divide it up properly and randomly. How do people on here go about combining and splitting up their datasets? Is there a tool or script that makes this more straightforward?

The only problem with “cat | split” is that the data might be imbalanced among your various input files. You want to randomize the entire set- this is a big part of what train_test_split does. Also, you don’t need dataframe, you can just make arrays.

you may have a look at this https://github.com/OpenNMT/OpenNMT-py/pull/1413
We created the multiple dataset option for this purpose.

@LanceNorskog
I agree completely that you can end up with a not-mixed validation set if you do it that way. How are you inputting your data if not through dataframes/read_csv, are you manually reading it through a file read function? I always seem to get errors with the read_csv which I need to make it skip lines in order to fix. This causes a ton of alignment issues so I just have been using the cat/split solution.

Ah! I understand about dirty data. It is possible to shuffle datasets with Unix command line tools. There is a unix program ‘shuf’ on Linux.

In the paper Data Ordering Patterns for Neural Machine Translation: An Empirical Study the perplexity ordering gains +1.8 BLEU on the IWSLT2015 English-Vietnamese language data set. The pretrained XLM-R could be perfect for preprocessing the training data for a large scale NMT. And there is already a function in the evaluator.py which calculates the perplexity:

def evaluate_clm(self, scores, data_set, lang1, lang2):
    """
    Evaluate perplexity and next word prediction accuracy.
    """