Transformers on low resource corpora

sanadi1209 · March 30, 2020, 11:05am

Hi… I was wondering if any of the researchers here had the experience of training the opennmt-tf Transformer models on very small parallel corpora (~30k sentence pairs). I have only been working with seq2seq models till now and been getting a moderate performance on the same dataset till now.

I tried training an opennmt Transformer model using the default settings and an Adam optimizer. I realized the model doesn’t converge and the loss stays more or less the same even after 5000 training steps. The output clearly showed that the model hasn’t trained at all in my case.

I request the developers and researchers for any tips or tweaks on how to properly use the model for low resource cases. I would really appreciate the help. Thanks

Bachstelze · April 3, 2020, 3:02pm

Hey San!
The transformer is quite sensitive to the hyperparameter settings and is usually used with a big dataset.
What is your language pair? You could fine-tune mBART, if it is in the pretrained languages.
Did you try to generate back-translations or other data augmentations?
Greetings

sanadi1209 · April 9, 2020, 12:08pm

Hello!

Thanks for responding. No, I haven’t used any back-translated data here, all the translations are human generated sentences. My language pair is of Indic-Dravidian family - Kannada and Telugu. Does mBART fine turning work for any language? Any other ideas on what opennmt models can be used for a low resource setting (other than transformers) - Please do share. Thanks!

Bachstelze · April 10, 2020, 11:39am

The authors of mBART write that the model can be finetuned also on other languages. But I experienced a lot of out of vocabulary tokens with a language that wasn’t in the pretraining (Are you interested in training Russian-Abkhazian parallel corpus?).

You could try to initialise the transformer with fastText word embeddings like in OpenNMT Pytorch - Using FastText Pretrained Embedding Tutorial for beginner
There are pretrained embeddings for Kannada and Telugu, so you don’t have to train them yourself.

To increase your dataset have a look at https://glosbe.com/kn/te
Maybe they give you access to 200k sentence pairs if you ask nice.

sanadi1209 · April 11, 2020, 11:27am

Thank you very much for all the links. I haven’t tried initializing with fastext embeddings. I will do it now.

Bachstelze · May 4, 2020, 1:00am

Hey San Chi,
how went your Labour Day?
Could you use the links?
The AI4Bharat-IndicNLP Corpus could be interesting for you:

Greetings from the translation space
https://bachstelze.gitlab.io/multisource/