OpenNMT Forum

What linguistic features can be added in order to improve translation results?

Hello,

I have been able to train a Transformer model to perform bilingual translation successfully, achieving a BLEU over 71. In order to improve the results, I want to explore how adding linguistic features related to source words affects the translation quality and BLEU score. I have been diving into the OpenNMT-tf documentation and into this forum I have seen that I can add features such as POS tagging and so on.

My questions are the following:

  1. What kind of features does OpenNMT-tf support? I am planning to use Spacy to get the features since I have both source and target language models trained, so I am able to easily obtain:
  • POS tagging: what is the format expected by OpenNMT?
  • Morphology: for example, from the word “I” Spacy can get: Case=Nom|Number=Sing|Person=1|PronType=Prs. Does OpenNMT support this? What is the format expected?
  • Lemmatization: what is the format expected by OpenNMT? Lemmas replacing the source words?
  • Name Entity Recognition: does OpenNMT support this? And, if so, what is the format expected? Some kind of “Entity type + BILUO scheme” annotation such as /I-PER /L-PER?

Is there any kind of documentation explaining that?

  1. As I have seen here, the way this features are added to the model is by having multiple source parallel train and validation files and provided them as inputs during inference as well. If I set my configuration file in that way, can I train the Transformer model with the command onmt-main --model_type Transformer --config ./config/myconfig.yml train --with_eval or do I have to make some additional changes?

  2. Btw, can I expect improving my results adding linguistic features?

Thank you so much

This is a very high BLEU score. Do you really need to further improve the results?

This should be the first question. It’s possible it can slightly improve the results, but in my opinion it is not worth it and adds complexity. It would be better to add more data for example with back translation or data augmentation.

OpenNMT-tf has a generic support of additional word features. There is no integration specifically for POS tags, morphology, etc. As long as you can represent your features with a label, and have one label per input token you can provide one additional file per feature as used in this model for example:

You can also read more about multiple input files here: Data — OpenNMT-tf 2.17.1 documentation

1 Like

Yes, it is a high BLEU. I am training the Transformer Base between two romance languages (catalan->spanish) and they are pretty similar. What’s more, training with guided alignments has also improve the results. It’s the first time I am using OpenNMT and doing NMT so I am trying to experiment with multiple options. My dataset is about 3.5 pair of sentences so I’ll start including more open datasets between these two languages that I have and then maybe I’ll experiment with back-translation and data augmentation as you suggest. However, it’s good to know the way I can add input features as well.

Thank you so much for your help.

@guillaumekln

This should be the first question. It’s possible it can slightly improve the results, but in my opinion it is not worth it and adds complexity. It would be better to add more data for example with back translation or data augmentation.

What do you mean by data augmentation?

I mean producing multiple input examples from the same sentence. For example using subword regularization (search for BPE dropout, SentencePiece sampling), or random noise (change case, remove punctuation, etc.).

1 Like