OpenNMT Forum

Are you interested in training Russian-Abkhazian parallel corpus?

Hello everyone,
Is there someone interested in Training Russian-Abkhazian parallel corpus?
There is 20600 lines of text, soon I will add more, it should reach 30000 Russian Abkhazian parallel corpus.

I can’t share it publicly because of copyright, but I can share it privately, if someone is interested in training, all I ask is to share back with me the best model you come up with.

The reason why I can’t do it myself because of limitations in resources.

Looking forward to hear from you.
Nart.
(p.s If I’m provided with remote access to resources, I can do the training)

Hi Nart, From my experience you probably won’t have enough sentence pairs to train a useful model unless it’s for a very restricted domain. Perhaps you should read up about building a synthetic corpus by back-translation or getting some volunteers to translate more sentences for you. I recently trained a few models with around 150,000 sentence pairs and the result was very disappointing. Good luck, it sounds a great project!

Hello Lewis,
I appreciate your feedback.
This is an ongoing effort, it shouldn’t stop at 30k.
The corpus I’ve made has mostly paragraphs as pairs. good amount of chunk of text in each line.
It would be interesting to check it’s efficiency in such a setting.

With such a small corpus you should be able to train fairly quickly on a CPU. The quick starts for PyTorch and TensorFlow are very easy to follow.

Do you have a large amount of monolingual text? This could be helpful for pretraining. Otherwise, I would consider this as a low ressource setting for SMT.

I actually did with the 20k but it’s not getting higher than 10% accuracy.
I am thinking using the trasfomer model instead and 30k would get a better result, but that needs a better computer.
Thank you for the information you shared about peusdo parallel corpus and back translation, I didn’t know about them.

I found out about the usage of monolingual text in training just today, it won’t be hard to get the russian monolingual text.
Regarding the Abkhazian text, I don’t have currently a ready to use monolingual text, I will have to build it, it is added to my to-do list.
What’s your thoughts on this?

Wikipedia is always a good source for natural language text. Perhaps you can get the corpus from BaltoSlav.

You don’t need a home server, for research projects the free amount of server farms can cope.

Try to copy the monolingual text before you use filtered back-translation.

1 Like

Have a look at wikiextractor to get the plain text from the xml-dumps. Do you have a tool to split the sentences?
To avoid the model to learn a simple copy mechanism, consider masking or permutation of the source (if you have a few times more monolingual than bilingual text and the initial back-translation quality isn’t high enough).

The extraction was done yesterday with wikiextractor. To split the sentences using atom with regex is the current tool in hand.
Having much more monolingual than bilingual text will be the case, so probably masking should be looked at some point.