Change opus format to be similar to toy dataset

ThesisHopeful · November 1, 2022, 7:56am

Hi guys! Newbie here.

I checked the openSubtitle datasets but I’m confused as to why the format of the dataset is not similar to the toy dataset used in the quickstart example.

For example, why does the English to german toy dataset have src-train,tgt-train,src-val and tgt-val text files while the opus datasets don’t?

This probably sounds like asking to be spoon-fed but all I really need is to be pointed to the right direction. What articles or documentation should I read? Is there a certain process I should follow before proceeding to the actual translation?

Thank you.