Breaking up Europarl Data into Training, Validation, and Testing

jchavezberkeley · April 23, 2021, 9:46pm

Hello there. I recently downloaded the Europarl V7 Dataset for Spanish-English. The data is really big but I want to know how to split it up into a training set, validation set, and test set. I know the files are of the type .en and .es, so is there a method or tool I can use to split up the data? Thanks

ymoslem · April 25, 2021, 8:07pm

Dear Jose,

I use this script train_dev_test_split.py. You can find my other MT preparation scripts in the same repository.

Kind regards,
Yasmin

jchavezberkeley · April 29, 2021, 3:01pm

Thank you! I used your split tool to split my data.