Incorporating Linguistic Features in Training Data?

yash-srivastava19 · June 12, 2023, 12:07pm

Hi Everyone !!

Actually, I needed some help guidance on a particular NMT system I am building, and for that I need to incorporate additional linguistic features(such as POS tags, Dependency Relation etc.). I was able to find similar issues here in the forum like this. I already have CoNLL-U format data for the source sentences, and just wanted to given them to the encoder as features, but I am unable to do so. I tried saving each feature into separate files and feeding to the ParallelInputter, but the size mismatch error was still prevalent.

In the old version of OpenNMT, this could be done, as given here. Is it still possible to do that, as I want to do something similar to this. I could also go the ParallelInputter route, but I can’t figure out what exactly I need to do resolve the error?

Any help on this matter would be highly appreciated(as I need this for a project)… I’m up for further clarification in this regards if it is necessary.

Thanks in advance

guillaumekln · June 13, 2023, 8:11am

Hi,

Have you checked the example configuration file?

github.com

OpenNMT/OpenNMT-tf/blob/master/config/models/multi_features_transformer.py

"""Defines a Transformer model with multiple input features. For example, these
could be words, parts of speech, and lemmas that are embedded in parallel and
concatenated into a single input embedding.

The features are separate data files with separate vocabularies. The YAML
configuration file should look like this:

data:
  train_features_file:
    - features_1.txt
    - features_2.txt
    - features_3.txt
  train_labels_file: target.txt
  source_1_vocabulary: feature_1_vocab.txt
  source_2_vocabulary: feature_2_vocab.txt
  source_3_vocabulary: feature_3_vocab.txt
  target_vocabulary: target_vocab.txt
"""

import opennmt

This file has been truncated. show original

You would need to save the input features in different training files and build the corresponding vocabularies.

yash-srivastava19 · June 13, 2023, 9:44am

Hi Guillaume !

Yeah I went through this file, and I tried to do that only. The problem, as I mentioned, was size one. I don’t get, as for each token, there is a single feature only(such as POS tag), and that’s why I asked whether the following way to concatenate inputs and features as(given here) :

word|feat1|feat2|...

If this works, then it is much good. Thanks !

guillaumekln · June 13, 2023, 10:02am

This input format is not supported in OpenNMT-tf.

yash-srivastava19 · June 13, 2023, 10:52am

I will try again the ParallelInputter thing, or otherwise, since this is available in OpenNMT-py, switching to that also wouldn’t be a hassle. Thanks for the help !!