How word features work?

lengockyquang · May 16, 2019, 4:58am

Hi everyone, i’ve read opennmt word feature document and it said that when we use multiple feature, i will concate together. Does it mean, if we have pos tag NNP with feature vector is [1,2,3] and named entity PERSON with feature vector is [4,5,6], so it will concate together and become [1,2,3,4,5,6] ? And the last sentence of word feature section in document is

Finally, the resulting merged embedding is concatenated to the word embedding.

I still don’t understand how merged feature embedding is concate with word embedding. Can you gives me better explaination ?

valentinmace · May 16, 2019, 8:42am

Hi

Can you precise which opennmt version are you using ?

EDIT: Hiding my answer since the question is for opennmt-py version

Answer for opennmt-tf

I can provide you some help with the tensorflow version

If you define your model with a ParallelInputter you could concatenate your 3 features (POS, NE, Word) as input for your model, see:

https://github.com/OpenNMT/OpenNMT-tf/blob/master/config/models/multi_features_nmt.py

You need to specify your features files in your yml configuration:

data:
  train_features_file:
    - train_file_words
    - train_file_POS
    - train_file_NE

Finally you need to be sure that your files are aligned, for each word correspond a POS and a NE

Doing that will lead to the concatenation you described: word_emb|POS_emb|NE_emb

Hope this helps

lengockyquang · May 16, 2019, 9:00am

I’m using lastest version of opennmt-py. I just want to understand how feature embedding is concat together and combine features with its word

guillaumekln · May 16, 2019, 9:37am

Yes.

They are merged the same way you did for the POS tag and named entity vectors.

lengockyquang · May 16, 2019, 10:01am

Thanks for your reply. For example, if i have a pretrained word embedding and pretrained feature embedding (i.e pos tag embedding), with token “John|NNP”, i will concate the embedding vector of word “John” with embedding vector of feature “NNP”, then it becomes the embedding vector for token “John|NNP”.

guillaumekln · May 16, 2019, 10:04am

Yes, that’s correct.

Dhanasekar-S · June 30, 2019, 11:47pm

Hey @lengockyquang @guillaumekln
Does the word features also work with the subwords model implemented by sentencepiece?
Even after using the u"\uFFE8" as a separator between words and features, the preprocessing script is not recognizing it as a feature ie…The number of features is still 0.

Can you help me out with adding features to words?
Thanks
The code looks something like this…
token.text + u"\uFFE8" + token.pos_