How to use source features in OpenNMT-py correctly?

Hi, I would like to train a Transformer model and tag source subwords with additional information. For my first research experiment I have added POS, Named Entity and domain information (generic/in-domain).

My training data is tokenized.

Here is an example of a source sentence:
“Austria – works at hospital in Spittal / Drau , Carinthia”

And now in subwords:
[‘▁Austria’, ‘▁–’, ‘▁works’, ‘▁at’, ‘▁hospital’, ‘▁in’, ‘▁Sp’, ‘ittal’, ‘▁/’, ‘▁Dra’, ‘u’, ‘▁,’, ‘▁Carinthia’]

My input features are:

  • Feature 1: POS:
    PROPN PUNCT VERB ADP NOUN ADP PROPN PROPN SYM PROPN PROPN PUNCT PROPN
  • Feature 2: NE:
    L O O O O O R R R R R O P
    (L- Location, R-Organization, P-Person, O-Other)
  • Feature 3: domain:
    0 0 0 0 0 0 0 0 0 0 0 0 0
    (0-generic domain, 1-in-domain segment)

Each input feature has the same number of subwords, as the original segment (in subwords). In total there are 13 subwords, and each feature has 13 signs.

First, I would like to learn the vocabulary and use the onmt_build_vocab command. For this post, I learned the vocabulary with the single segment from the example, no other data has been used.

I use the following transformations: transforms: [sentencepiece, filtertoolong, filterfeats, inferfeats].

The onmt_build_vocab runs through without any other errors.

This is the output from vocab file for feature 2:

O 4
R 4
(null) 3
L 1
P 1

  1. Why am I getting a (null) feature if every subword is properly tagged?
  2. There are 5 occurrences of Organization (“R”) as tagged in the training data. Why does the output file says it is only 4? There are also other inconsistencies in the counter for other features as well, e.g. for POS.

@anderleich do you have time to debug this with @maxiek0071 ?
maybe making sure source features work fine with the new format is a better starting point.

Hi @maxiek0071 ,

Source features support is broken for the newest version (v3.0) of OpenNMT-py. I’m currently adapting the code to make source features available again (see [WIP] Support target features by anderleich · Pull Request #2289 · OpenNMT/OpenNMT-py · GitHub).

I don’t know whether your are using the v3.0 version or an older one. In either case, I’ll try to debug the newest version I’m working on to check it is indeed generating a correct vocabulary file for the features. One thing it comes to my mind is that you are using sentencepiece instead of BPE to tokenize your data. I have not detected those issues with BPEed data, however, it is possible there is a bug with data tokenized with sentencepiece.

Hi @anderleich,

Thank you for the update. I am using OpenNMT-py version 2.3.0, as we have 100+ models trained with this version.

If you need anymore information, please let me know.

Thank you.