I’m trying to train a model on English – Russian parallel corpus. The corpus is rather big (more than 50M rows). I’ve also added to this corpus an English – Russian dictionary containing about 50k pairs.
However many examples are translated incorrectly. For example the word
BUTTERCUP (in upper case) in English should be translated like
ЛЮТИК which means a flower. But I always get the result
БАБОЧКА which means a butterfly. I’ve looked into my data and the word
butterfly encounters 10 times more then the word
buttercup. So I’ve tried the following approach – generating synthetic rows with
BUTTERCUP in about 1000 rows and adding them to the training data – it worked fine.
I’ve also tried to remove single words from validation dataset, but it didn’t helped.
The question is maybe there are better and more effective approaches to make a model translate single words exactly as they should be?
Many thanks in advance!