Single words incorrect translation

Dmitry · July 21, 2023, 10:19am

Hello, community!

I’m trying to train a model on English – Russian parallel corpus. The corpus is rather big (more than 50M rows). I’ve also added to this corpus an English – Russian dictionary containing about 50k pairs.

However many examples are translated incorrectly. For example the word BUTTERCUP (in upper case) in English should be translated like ЛЮТИК which means a flower. But I always get the result БАБОЧКА which means a butterfly. I’ve looked into my data and the word butterfly encounters 10 times more then the word buttercup. So I’ve tried the following approach – generating synthetic rows with BUTTERCUP in about 1000 rows and adding them to the training data – it worked fine.

I’ve also tried to remove single words from validation dataset, but it didn’t helped.

The question is maybe there are better and more effective approaches to make a model translate single words exactly as they should be?

Many thanks in advance!

SamuelLacombe · July 26, 2023, 11:03am

Hello,

Have you considered using multiple output when translating (with the beam size).

One thing you can do is use ngrams to enrich your data. How to generate ngrams

Dmitry · July 27, 2023, 10:35am

Samuel, many thanks for answering!!! You know, I’ve read several times description of your approach - but I couldn’t grasp what exactly to do to repeat your approach. Maybe some later, I’ll get it

SamuelLacombe · July 28, 2023, 11:18am

Hello,

No need to do exactly as i did. Having just the 1 gram should be sufficient in your case.

You can try to use the aligner and see what the output look like and work your way to filter the results and keep the data you want.

Best regards,
Samuel