Best Practice for Splitting Datasets

BaruchG · November 5, 2019, 8:18pm

What is the best practice for combining disparate datasets and splitting them into training, validation and testing datasets as it relates to NMT? What I have been doing is using the cat command on multiple datasets to combine them and then split that into the three files that I need by using the split command.
I guess that it would be better to put everything into a python dataframe which I can then use the sklearn.model_selection.train_test_split twice to divide it up properly and randomly. How do people on here go about combining and splitting up their datasets? Is there a tool or script that makes this more straightforward?

LanceNorskog · November 6, 2019, 11:28pm

The only problem with “cat | split” is that the data might be imbalanced among your various input files. You want to randomize the entire set- this is a big part of what train_test_split does. Also, you don’t need dataframe, you can just make arrays.

vince62s · November 7, 2019, 6:02am

you may have a look at this https://github.com/OpenNMT/OpenNMT-py/pull/1413
We created the multiple dataset option for this purpose.

BaruchG · November 7, 2019, 4:41pm

@LanceNorskog
I agree completely that you can end up with a not-mixed validation set if you do it that way. How are you inputting your data if not through dataframes/read_csv, are you manually reading it through a file read function? I always seem to get errors with the read_csv which I need to make it skip lines in order to fix. This causes a ton of alignment issues so I just have been using the cat/split solution.

LanceNorskog · November 11, 2019, 1:02am

Ah! I understand about dirty data. It is possible to shuffle datasets with Unix command line tools. There is a unix program ‘shuf’ on Linux.

Bachstelze · November 13, 2019, 7:59pm

In the paper Data Ordering Patterns for Neural Machine Translation: An Empirical Study the perplexity ordering gains +1.8 BLEU on the IWSLT2015 English-Vietnamese language data set. The pretrained XLM-R could be perfect for preprocessing the training data for a large scale NMT. And there is already a function in the evaluator.py which calculates the perplexity:

def evaluate_clm(self, scores, data_set, lang1, lang2):
    """
    Evaluate perplexity and next word prediction accuracy.
    """