OpenNMT Forum

How word features work?

Hi everyone, i’ve read opennmt word feature document and it said that when we use multiple feature, i will concate together. Does it mean, if we have pos tag NNP with feature vector is [1,2,3] and named entity PERSON with feature vector is [4,5,6], so it will concate together and become [1,2,3,4,5,6] ? And the last sentence of word feature section in document is

Finally, the resulting merged embedding is concatenated to the word embedding.

I still don’t understand how merged feature embedding is concate with word embedding. Can you gives me better explaination ?

Hi

Can you precise which opennmt version are you using ?

EDIT: Hiding my answer since the question is for opennmt-py version

Answer for opennmt-tf

I can provide you some help with the tensorflow version

If you define your model with a ParallelInputter you could concatenate your 3 features (POS, NE, Word) as input for your model, see:

https://github.com/OpenNMT/OpenNMT-tf/blob/master/config/models/multi_features_nmt.py

You need to specify your features files in your yml configuration:

data:
  train_features_file:
    - train_file_words
    - train_file_POS
    - train_file_NE

Finally you need to be sure that your files are aligned, for each word correspond a POS and a NE

Doing that will lead to the concatenation you described: word_emb|POS_emb|NE_emb

Hope this helps

1 Like

I’m using lastest version of opennmt-py. I just want to understand how feature embedding is concat together and combine features with its word

Yes.

They are merged the same way you did for the POS tag and named entity vectors.

1 Like

Thanks for your reply. For example, if i have a pretrained word embedding and pretrained feature embedding (i.e pos tag embedding), with token “John|NNP”, i will concate the embedding vector of word “John” with embedding vector of feature “NNP”, then it becomes the embedding vector for token “John|NNP”.

Yes, that’s correct.

1 Like

Hey @lengockyquang @guillaumekln
Does the word features also work with the subwords model implemented by sentencepiece?
Even after using the u"\uFFE8" as a separator between words and features, the preprocessing script is not recognizing it as a feature ie…The number of features is still 0.

Can you help me out with adding features to words?
Thanks
The code looks something like this…
token.text + u"\uFFE8" + token.pos_