Building resources

(Terence Lewis) #1

I am building a training corpus for a language pair where the largest resource is the Open Subtitles (approx 3M sentences). I have gathered other smaller sets of bilingual texts in domains such as science, law and economics and am creating a small hand-crafted set of “sentence patterns”. I am curious about other people’s experiences with such mixed resources. Would I fare better “mixing in” the smaller corpora with the subtitles? Or perhaps showing the network numerous copies of the smaller sets of data? I want to avoid a result where every translation sounds like a character in an American movie :slight_smile: