OpenNMT Forum

Breaking up Europarl Data into Training, Validation, and Testing

Hello there. I recently downloaded the Europarl V7 Dataset for Spanish-English. The data is really big but I want to know how to split it up into a training set, validation set, and test set. I know the files are of the type .en and .es, so is there a method or tool I can use to split up the data? Thanks

Dear Jose,

I use this script train_dev_test_split.py. You can find my other MT preparation scripts in the same repository.

Kind regards,
Yasmin

1 Like

Thank you! I used your split tool to split my data.

1 Like