Using OpenNMT with XLM-R Embeddings

nlpdude · March 17, 2020, 11:41am

I am going to build Catalan to Catalan translation system using OpenNMT. I wonder if there is away to use embeddings from XLM-R. Please support me on this issue

Bachstelze · March 17, 2020, 6:29pm

Hey Johnas Solomon,
as far as I know, you can only use word embeddings with openNMT and not a pretrained sequence encoder. But XLM was initially used as pretrained encoder for low resource languages. Unsupervised Cross-lingual Representation Learning at Scale has quite the same transformer architecture which should you allow to use the training script file from xlm with adapted settings:

Depending on the version:
Base(L= 12, H = 768, A = 12, 270M params) and Large(L = 24, H = 1024, A = 16, 550M params)

I am going to build Catalan to Catalan translation system

What is your aim? It sounds like an automatic grammar correction system.

Greetings from the translation space

nlpdude · March 18, 2020, 6:20am

Thank you for your reply, @Bachstelze, My aim is to make a translation between Catalan to Catalan Sing Language. The grammar of the two languages is different (e.g Input-> He sells food. Output (sign language sentence)-> Food he sells).

Could you please elaborate your answer? I’m very new to the filed of deep learning.

Bachstelze · March 18, 2020, 8:46am

My answer is that you can use different versions of XML for translations with the pytorch or fairseq framework. Moreover, there are pretrained multilingual seq2seq models. For openNMT you can use word embeddings like fasttext which has pretrained embeddings for Catalan. With word embeddings you are going to have a fixed vocabulary, but I have seen a translation implementation which uses the function of fasttext to construct embeddings with the sum of character ngram vectors. The current openNMT implementation should be good, if you are only interested in the translation of standard sentences with no out-of-vocabulary words.

Depending on your dataset (amount of monolingual and parallel text) and model, you could use copied monolingual data or back-translation to improve your accuracy and quality.

Have a look at the use-cases of Embeddings and the illustrated transformer for a general overview. Let me know if you have a concrete question.

nlpdude · March 18, 2020, 8:58am

Thank you @Bachstelze. It makes sense. Thank you for your helpful answer. I may seek your advice in the future.

nlpdude · April 9, 2020, 8:01am

Hello @Bachstelze. As you suggested, I’m trying to build a translation model using XLM with fairseq. However, the whole system is a little complicated and lucks documentation on how to implement a custom dataset. I have two questions again(sorry they are silly but I’m a little confused and I need you valuable guidance.):

Would you please kindly help/share a practical example?
There is an example on how to build a sentiment analyzer using transformers Roberta so can I build something similar custom NMT model?

Thank you.

Bachstelze · April 9, 2020, 9:46am

In most use cases the raw translation data is formatted in different files for each language. Every sentence is just separated by a newline “\n” and has the same row count as its translation.

XLM is very experimental and will stay like this considering the rise of completely pretrained architectures. If you want practical results then I would suggest using the fasttext word embeddings in openNMT.

Would you please kindly help/share a practical example?

For Catalan word embeddings download the linked fasttext embedding and start with Step 1: Preprocess the data:

This could help for XLM:

There is an example on how to build a sentiment analyzer using transformers Roberta so can I build something similar custom NMT model?

You could try to initialize the encoder with XLM-RoBERTa like:

Greetings from the translation space