What is the best practice for combining disparate datasets and splitting them into training, validation and testing datasets as it relates to NMT? What I have been doing is using the cat
command on multiple datasets to combine them and then split that into the three files that I need by using the split command.
I guess that it would be better to put everything into a python dataframe which I can then use the sklearn.model_selection.train_test_split
twice to divide it up properly and randomly. How do people on here go about combining and splitting up their datasets? Is there a tool or script that makes this more straightforward?
The only problem with “cat | split” is that the data might be imbalanced among your various input files. You want to randomize the entire set- this is a big part of what train_test_split does. Also, you don’t need dataframe, you can just make arrays.
you may have a look at this https://github.com/OpenNMT/OpenNMT-py/pull/1413
We created the multiple dataset option for this purpose.
@LanceNorskog
I agree completely that you can end up with a not-mixed validation set if you do it that way. How are you inputting your data if not through dataframes/read_csv, are you manually reading it through a file read function? I always seem to get errors with the read_csv which I need to make it skip lines in order to fix. This causes a ton of alignment issues so I just have been using the cat/split solution.
Ah! I understand about dirty data. It is possible to shuffle datasets with Unix command line tools. There is a unix program ‘shuf’ on Linux.
In the paper Data Ordering Patterns for Neural Machine Translation: An Empirical Study the perplexity ordering gains +1.8 BLEU on the IWSLT2015 English-Vietnamese language data set. The pretrained XLM-R could be perfect for preprocessing the training data for a large scale NMT. And there is already a function in the evaluator.py
which calculates the perplexity:
def evaluate_clm(self, scores, data_set, lang1, lang2):
"""
Evaluate perplexity and next word prediction accuracy.
"""