What linguistic features can be added in order to improve translation results?

villalbacanteroa · March 24, 2021, 5:52pm

Hello,

I have been able to train a Transformer model to perform bilingual translation successfully, achieving a BLEU over 71. In order to improve the results, I want to explore how adding linguistic features related to source words affects the translation quality and BLEU score. I have been diving into the OpenNMT-tf documentation and into this forum I have seen that I can add features such as POS tagging and so on.

My questions are the following:

What kind of features does OpenNMT-tf support? I am planning to use Spacy to get the features since I have both source and target language models trained, so I am able to easily obtain:

POS tagging: what is the format expected by OpenNMT?
Morphology: for example, from the word “I” Spacy can get: Case=Nom|Number=Sing|Person=1|PronType=Prs. Does OpenNMT support this? What is the format expected?
Lemmatization: what is the format expected by OpenNMT? Lemmas replacing the source words?
Name Entity Recognition: does OpenNMT support this? And, if so, what is the format expected? Some kind of “Entity type + BILUO scheme” annotation such as /I-PER /L-PER?

Is there any kind of documentation explaining that?

As I have seen here, the way this features are added to the model is by having multiple source parallel train and validation files and provided them as inputs during inference as well. If I set my configuration file in that way, can I train the Transformer model with the command onmt-main --model_type Transformer --config ./config/myconfig.yml train --with_eval or do I have to make some additional changes?
Btw, can I expect improving my results adding linguistic features?

Thank you so much

guillaumekln · March 25, 2021, 1:49pm

This is a very high BLEU score. Do you really need to further improve the results?

This should be the first question. It’s possible it can slightly improve the results, but in my opinion it is not worth it and adds complexity. It would be better to add more data for example with back translation or data augmentation.

OpenNMT-tf has a generic support of additional word features. There is no integration specifically for POS tags, morphology, etc. As long as you can represent your features with a label, and have one label per input token you can provide one additional file per feature as used in this model for example:

github.com

OpenNMT/OpenNMT-tf/blob/master/config/models/multi_features_transformer.py

"""Defines a Transformer model with multiple input features. For example, these
could be words, parts of speech, and lemmas that are embedded in parallel and
concatenated into a single input embedding.

The features are separate data files with separate vocabularies. The YAML
configuration file should look like this:

data:
  train_features_file:
    - features_1.txt
    - features_2.txt
    - features_3.txt
  train_labels_file: target.txt
  source_1_vocabulary: feature_1_vocab.txt
  source_2_vocabulary: feature_2_vocab.txt
  source_3_vocabulary: feature_3_vocab.txt
  target_vocabulary: target_vocab.txt
"""

import tensorflow as tf

This file has been truncated. show original

You can also read more about multiple input files here: Data — OpenNMT-tf 2.17.1 documentation

villalbacanteroa · March 25, 2021, 4:13pm

Yes, it is a high BLEU. I am training the Transformer Base between two romance languages (catalan->spanish) and they are pretty similar. What’s more, training with guided alignments has also improve the results. It’s the first time I am using OpenNMT and doing NMT so I am trying to experiment with multiple options. My dataset is about 3.5 pair of sentences so I’ll start including more open datasets between these two languages that I have and then maybe I’ll experiment with back-translation and data augmentation as you suggest. However, it’s good to know the way I can add input features as well.

Thank you so much for your help.

Nart · March 29, 2021, 8:54pm

@guillaumekln

This should be the first question. It’s possible it can slightly improve the results, but in my opinion it is not worth it and adds complexity. It would be better to add more data for example with back translation or data augmentation.

What do you mean by data augmentation?

guillaumekln · March 30, 2021, 7:25am

I mean producing multiple input examples from the same sentence. For example using subword regularization (search for BPE dropout, SentencePiece sampling), or random noise (change case, remove punctuation, etc.).

gaussmao · February 1, 2024, 6:47am

Hi, great answer above. Also, wondering if there is a way in openNMT to make the trained model support bidirectional translation (i.e., not need to train another model for reverse translation)?