Best practices for multiple corpora

JptoEn · July 9, 2021, 12:20pm

Hi I am attempting to build a Japanese<–>English NMT system in the Colab environment. I managed to get a transformer proof of concept working but it just outputs UNK on the test data (presumably because I didn’t do any pre-processing).

I have 4 significant corpora, some of them were pre-split into train/validation/test so that is fine. Some I need to do it manually so I will try this script: MT-Preparation/train_dev_test_split.py at main · ymoslem/MT-Preparation · GitHub

But should I be concatenating all my validation data together into two files of 8000 lines, or take 500 lines from each? And should these corpora be weighted equally, or relative to their size? Because it seems like the config supports multiple corpora but one validation set?

Thanks,
Matt