Are you interested in training Russian-Abkhazian parallel corpus?

Nart · March 21, 2020, 11:15am

Yes, interesting results! I used the small model in openNMT tutorial with pytorch. Abkhazian is fully an ergative language, highly ambiguous, highly morphological, those factors could play some role.
You can share the data privately and utilize it anyway you see it fit.
I couldn’t say if those two files have different encoding, I’m not sure if you meant by that ASCII code.
If you could point me out how to denoise the monolingual data I could work on that part.
We could set up a Skype meeting on Monday for 15 minutes, I have a flexible schedule let me know what time works best for you?

Bachstelze · March 22, 2020, 8:27am

Abkhazian is fully an ergative language, highly ambiguous, highly morphological, those factors could play some role.

Abkhazian sounds complex and very interesting. I wish that we will integrate it into a multilingual system. For now, the bilingual translation system should benefit from byte pair encoding to process the high morphology.

I’m not sure if you meant by that ASCII code.

Yes, the Russian text is okay after saving it with the system editor. That set all files to UTF-8. But the Abkhazian files have about 1/10 amount of unknown tokens:

2020-03-21 23:03:53 | INFO | fairseq_cli.preprocess | [ab] Dictionary: 250000 types
2020-03-21 23:04:01 | INFO | fairseq_cli.preprocess | [ab] train.bpe.ab-ru.ab: 21845 sents, 1513056 tokens, 9.97% replaced by <unk>
2020-03-21 23:04:01 | INFO | fairseq_cli.preprocess | [ab] Dictionary: 250000 types
2020-03-21 23:04:01 | INFO | fairseq_cli.preprocess | [ab] valid.bpe.ab-ru.ab: 186 sents, 13200 tokens, 10.0% replaced by <unk>
2020-03-21 23:04:01 | INFO | fairseq_cli.preprocess | [ab] Dictionary: 250000 types
2020-03-21 23:04:02 | INFO | fairseq_cli.preprocess | [ab] test.bpe.ab-ru.ab: 31 sents, 2266 tokens, 8.16% replaced by <unk>
2020-03-21 23:04:02 | INFO | fairseq_cli.preprocess | [ru] Dictionary: 250000 types
2020-03-21 23:04:06 | INFO | fairseq_cli.preprocess | [ru] train.bpe.ab-ru.ru: 21845 sents, 674018 tokens, 0.00712% replaced by <unk>
2020-03-21 23:04:06 | INFO | fairseq_cli.preprocess | [ru] Dictionary: 250000 types
2020-03-21 23:04:07 | INFO | fairseq_cli.preprocess | [ru] valid.bpe.ab-ru.ru: 186 sents, 5557 tokens, 0.0% replaced by <unk>
2020-03-21 23:04:07 | INFO | fairseq_cli.preprocess | [ru] Dictionary: 250000 types
2020-03-21 23:04:07 | INFO | fairseq_cli.preprocess | [ru] test.bpe.ab-ru.ru: 31 sents, 1048 tokens, 0.0% replaced by <unk>

It seems that the vocabulary is too different compared with the pretrained mBART languages. We could add the missing, untrained tokens, but I think that it would probably be better to start masked denoising from a specific vocab and a new two-way model like the supervised example. Predicting the future N-gram could even be better as training objective.

Do you know a Russian-Abkhazian dictionary to enlarge the parallel corpus?

Let me know what time works best for you?

How about midday UTC+3?

Nart · March 22, 2020, 9:28am

A system that consists of different families of languages, with different features, might give a balanced universal multilingual system, right now the mbart consists of 25 languages that are tied to one language this might result in biased hyperparameters of the presumed parent universal model, as far as I know their next step is to build an mbart that consists of a 100 language, I hope they take into consideration that factor.

If you mean an actual dictionary, here’s a link: https://drive.google.com/file/d/1PbJhM2XPH2pFOeVdXknNAEDBQo7y8dfb/view?usp=drivesdk

That sounds good. Let’s go for 13:00 UTC+3

Bachstelze · March 22, 2020, 3:58pm

There is with XLM-RoBERTa a pretrained encoder which is scaled to a hundred languages including Armenian, Azerbaijani, Georgian, Greek and Russian. Nonetheless, nearly 10% of the Abkhazian tokens are unknown with this sentencepiece. The results are coherent, considering the fact that the Pontic language-family is completely missing for the vocabulary calculation:

2020-03-22 12:48:37 | INFO | fairseq_cli.preprocess | [ab] Dictionary: 250000 types
2020-03-22 12:48:45 | INFO | fairseq_cli.preprocess | [ab] train.bpe.ab-ru.ab: 22255 sents, 1648058 tokens, 9.97% replaced by <unk>
2020-03-22 12:48:45 | INFO | fairseq_cli.preprocess | [ru] Dictionary: 250000 types
2020-03-22 12:48:48 | INFO | fairseq_cli.preprocess | [ru] train.bpe.ab-ru.ru: 22255 sents, 733308 tokens, 0.00777% replaced by <unk>

If you mean an actual dictionary, here’s a link: https://drive.google.com/file/d/1PbJhM2XPH2pFOeVdXknNAEDBQo7y8dfb/view?usp=drivesdk

Yes, i mean an actual dictionary which can be used by a program or processed as parallel text. Could you please elaborate the structure of the linked dictionary? The first part is tagged with special terms like beekeeping, proverbs or adjectives. After page 400 it seems to change and restart from the letter H?

Nart · March 22, 2020, 8:07pm

The dictionary consists of three toms 1,2 & 3, they were 3 seperate PDF files, I joined them together, but they are a continuation of each other, you will also find that the second tom ends at 1091, and the third tom starts.

Nart · March 23, 2020, 6:22am

This is unclear to me, could you provide an example?

Bachstelze · March 23, 2020, 9:35am

The terms for beekeeping are tagged with “пчел.” for the Russian пчеловодческий термин
That is what I understand about the acronym definition. I will try to parse this big dictionary with pdfminer. For the beginning we should delete the special terms, although they are very useful for domain differentiation.

Bachstelze · June 3, 2020, 11:48am

After cleaning we have 1352553 monolingual abkhazian sentences which are used for pretraining and validation.
The russian pretaining is based on 10 million sentences from http://data.statmt.org/news-crawl/

Pretraining the xtransformer for 2 epochs yields a BLEU score around 15: https://gitlab.com/Bachstelze/alp

The current bilingual training amount is 324959 examples, which splits into:
126928 parallel sentences
198031 simple parallel paraphrases

I think that we can’t simple use single word translation from the dictionary as training data, because the model then tends to summarize with a short output. We could connect multiple words to a parallel list which is grammatical trivial, but it should be valid.

Nart · June 9, 2020, 6:40am

@Bachstelze Could you push the simple parallel paraphrases (198k) to GitHub?
There are issues I came across the parallel sentences, I have filtered down to 90k.

I have a question regarding which model is better to use for the next iteration:

Update:
The experiments results are faulty; many test texts were part of the training text.
~~Model A:~~

~~Preprocess: sentencepiece 32k + lowercase + allow long sentences~~

~~Data: 104k dictionary + 90k Parallel + 71k paraphrase~~

~~Model Type: SMT~~

~~Result:~~

~~Tokenized BLEU = 21.34, 50.4/27.8/19.4/14.6 (BP=0.849, ratio=0.860, hyp_len=13099, ref_len=15240)~~

~~Detokenized BLEU = 18.34, 42.8/23.3/15.9/12.0 (BP=0.877, ratio=0.884, hyp_len=10169, ref_len=11501)~~

~~Model B:~~

~~Preprocess: Moses + sentencepiece 32k + allow long sentences~~

~~Data: 104k dictionary + 90k Parallel + 71k paraphrase~~

~~Model Type: SMT~~

~~Result:~~

~~Tokenized BLEU = 27.70, 51.9/34.3/27.3/22.8 (BP=0.853, ratio=0.863, hyp_len=16261, ref_len=18839)~~

~~Detokenized BLEU = 15.75, 38.9/21.4/14.8/10.8 (BP=0.826, ratio=0.839, hyp_len=9955, ref_len=11860)~~

~~Model C:~~

~~Preprocess: Moses + sentencepiece 8k + allow long sentences~~

~~Data: 104k dictionary + 90k Parallel + 71k paraphrase~~

~~Model Type: SMT~~

~~Result:~~

~~Tokenized BLEU = 27.50, 53.2/35.3/28.0/23.0 (BP=0.830, ratio=0.843, hyp_len=19147, ref_len=22721)~~

~~Detokenized BLEU = 14.90, 39.7/21.5/14.2/9.9 (BP=0.801, ratio=0.818, hyp_len=9602, ref_len=11734)~~

(You could replicate model B using data.zip in the scripts folder.)

Bachstelze · June 9, 2020, 8:16pm

Very interesting that we can use bpe with SMT and Moses.

Could you push the simple parallel paraphrases (198k) to GitHub?

They are generated with the 126k parallel sentences and are potential erroneous (considering that nearly 40k are filtered out). I am going to update them based on a clean corpus.

I have a question regarding which model is better to use for the next iteration:

It depends on the next step. If the quality is good enough to generate pivot training data, then we could use the model to initiate a (multilingual) model.
A language model will make the statistical Moses model more fluent and both models should be tested with it.
There are a lot of other possible next iterations.
The MASS pretraining is still running and will be updated in the ALP repo after a few more epochs.