Hello, community!
I’m trying to train a model on English – Russian parallel corpus. The corpus is rather big (more than 50M rows). I’ve also added to this corpus an English – Russian dictionary containing about 50k pairs.
However many examples are translated incorrectly. For example the word BUTTERCUP
(in upper case) in English should be translated like ЛЮТИК
which means a flower. But I always get the result БАБОЧКА
which means a butterfly. I’ve looked into my data and the word butterfly
encounters 10 times more then the word buttercup
. So I’ve tried the following approach – generating synthetic rows with BUTTERCUP
in about 1000 rows and adding them to the training data – it worked fine.
I’ve also tried to remove single words from validation dataset, but it didn’t helped.
The question is maybe there are better and more effective approaches to make a model translate single words exactly as they should be?
Many thanks in advance!