Are you interested in training Russian-Abkhazian parallel corpus?

Hello everyone,
Is there someone interested in Training Russian-Abkhazian parallel corpus?
There is 20600 lines of text, soon I will add more, it should reach 30000 Russian Abkhazian parallel corpus.

I can’t share it publicly because of copyright, but I can share it privately, if someone is interested in training, all I ask is to share back with me the best model you come up with.

The reason why I can’t do it myself because of limitations in resources.

Looking forward to hear from you.
(p.s If I’m provided with remote access to resources, I can do the training)

Hi Nart, From my experience you probably won’t have enough sentence pairs to train a useful model unless it’s for a very restricted domain. Perhaps you should read up about building a synthetic corpus by back-translation or getting some volunteers to translate more sentences for you. I recently trained a few models with around 150,000 sentence pairs and the result was very disappointing. Good luck, it sounds a great project!

Hello Lewis,
I appreciate your feedback.
This is an ongoing effort, it shouldn’t stop at 30k.
The corpus I’ve made has mostly paragraphs as pairs. good amount of chunk of text in each line.
It would be interesting to check it’s efficiency in such a setting.

With such a small corpus you should be able to train fairly quickly on a CPU. The quick starts for PyTorch and TensorFlow are very easy to follow.

Do you have a large amount of monolingual text? This could be helpful for pretraining. Otherwise, I would consider this as a low ressource setting for SMT.

I actually did with the 20k but it’s not getting higher than 10% accuracy.
I am thinking using the trasfomer model instead and 30k would get a better result, but that needs a better computer.
Thank you for the information you shared about peusdo parallel corpus and back translation, I didn’t know about them.

I found out about the usage of monolingual text in training just today, it won’t be hard to get the russian monolingual text.
Regarding the Abkhazian text, I don’t have currently a ready to use monolingual text, I will have to build it, it is added to my to-do list.
What’s your thoughts on this?

Wikipedia is always a good source for natural language text. Perhaps you can get the corpus from BaltoSlav.

You don’t need a home server, for research projects the free amount of server farms can cope.

Try to copy the monolingual text before you use filtered back-translation.

1 Like

Have a look at wikiextractor to get the plain text from the xml-dumps. Do you have a tool to split the sentences?
To avoid the model to learn a simple copy mechanism, consider masking or permutation of the source (if you have a few times more monolingual than bilingual text and the initial back-translation quality isn’t high enough).

The extraction was done yesterday with wikiextractor. To split the sentences using atom with regex is the current tool in hand.
Having much more monolingual than bilingual text will be the case, so probably masking should be looked at some point.

@Bachstelze What is your thought on using pretrained word embedding first?
I can get the russian from Facebook pretrained fastText embeddeings , and build an abkhazian pretrained word embedding with fastText, using the Abkhaz monolingual corpus, it has 10m words.
Afterwards, I could then apply forward and back translation with the methods you’ve mentioned above.

As far as I understand to use a pretrained word embedding for Russian(src)-Abkhaz(tgt) model, I need to do the following command:

onmt_train -data/demo20k -save_model enru20k-model -world_size 1 -gpu_ranks 0 \
-pre_word_vecs_enc /data/ \
-pre_word_vecs_dec /data/

Did you get a copy of the whole corpus? We should ask the authors for a complete download, otherwise we have to query every single document.
How is the quality of the Wikipedia extraction?

Pretrained word embeddings are good for low resource NMT settings. And aligned embeddings could improve the results further. But it is not clear how good they are if the same monolingual corpus is used again for back-translation (or other incorporation strategies).
It makes sense to use pretrained word embeddings, If you plan to use those embeddings again (e.g. for other language pairs). For one system it seems like an unnecessary overhead which limits you to a fixed vocabulary or to the generation of out of vocabulary words with another program. A problem that is well solved with sentencepiece.

Yes, I did. Paul gave me the corpus in text format, mark ups should to be removed, it requires minimal cleaning. The Wiki on the other hand is messier.

This research in it’s early stage now, but I am hoping at some point a clear strategy will emerge to tackle this task.

Aye, very nice that you received such a good corpus!
Can you share your monolingual and parallel corpus with us? Then we could fine-tune the universal language space of mBART to Russian-Abkhazian. If we have also a model for Abkhazian-Russian, then we could use back-translation. Though the fairseq toolkit is another framework then opennmt. What do you think?

This sounds great! I added you as a contributor for both repos on Github.
I can’t share it publicly with everyone because of copyright, but you could privately share it.
From my end, I’ll work on cleaning/aligning the material we have on the parallel corpus.

1 Like

@Bachstelze I can see that you are now a collaborator on the parallel corpus.
I reinvited you to the monolingual corpus.
Give me a few days to get the corpuses ready for you before you start any serious training.

1 Like

@Bachstelze It needs more than few days to clean the monolingual corpus.
You have not accepted the invitation to that repo, is everything OK?

Yes, everything is perfect. I have accepted the first invitation to the repo with your recent commit, and I am already preparing the training setup. The test setup will be a Russian-Abkhazian and Abkhazian-Russian translation model based on the pretrained 12 layer transfomer. It will differ from the en-ro example training that the translation direction is bidirectional and uses monolingual text. After the successful test setup we can train it on the clean data and then perform back-translation with this two-way translation model.

It would be worth it to have scrum meeting/update at least once a week, to see where we are standing at, it will also guide us forward.
From my end, the tasks I am working on:

  1. a. I have trained Abkhazian-Russian model with the default python model (Small MT)
    b. Data consisted of ab-ru corpus 18k training, 2k validation and 600 for testing
    c. Results are the following:

    [2020-03-16 12:31:48,985 INFO] number of examples: 2000
    [2020-03-16 12:31:55,093 INFO] Validation perplexity: 64066.9
    [2020-03-16 12:31:55,093 INFO] Validation accuracy: 25.5176
    [2020-03-16 12:31:55,495 INFO] Saving checkpoint
    Using test data: BLEU = 9.13, 22.5/10.7/8.4/7.3 (BP=0.830, ratio=0.843, hyp_len=8149, ref_len=9667)

  2. Testing on Bitexter in the hope to automate parallel corpus aligning and cleaning.

  3. Manually cleaning and adding sentences to the current parallel corpus that we have. (in the draft folder , the files have around 9k lines (250k words) to be aligned and cleaned)

Wow, the results are looking promising! I didn’t expect any BLEU results only with the small parallel corpus.
Yes, one meeting at the beginning of the week is a good habit for constructive work or further meetings. In which form do want to do the meeting?
I will for now work on fine-tuning the pretrained mBart to a two-way model. Though the part for denoising the monolingual data is completely missing. Is it alright if I link the private repo in colab?
Do you know if the Russian version of the constitution and parliament text have a different encoding? With the sentencepiece model from mBart 2/3 of the tokens of those files are unknown. Though the rest of the files are processed without objection.

To 1. a) What is the default python model (Small MT)? Probably the small model used in the openNMT tutorial with pytorch.
The amount of crawlable parallel text from will grow on. So it could be okay to process the complete text on a late stage and include the data in the incremental training (with back-translation).