Incorporating Linguistic Features in Training Data?

Hi Everyone !!

Actually, I needed some help guidance on a particular NMT system I am building, and for that I need to incorporate additional linguistic features(such as POS tags, Dependency Relation etc.). I was able to find similar issues here in the forum like this. I already have CoNLL-U format data for the source sentences, and just wanted to given them to the encoder as features, but I am unable to do so. I tried saving each feature into separate files and feeding to the ParallelInputter, but the size mismatch error was still prevalent.

In the old version of OpenNMT, this could be done, as given here. Is it still possible to do that, as I want to do something similar to this. I could also go the ParallelInputter route, but I can’t figure out what exactly I need to do resolve the error?

Any help on this matter would be highly appreciated(as I need this for a project)… I’m up for further clarification in this regards if it is necessary.

Thanks in advance

Hi,

Have you checked the example configuration file?

You would need to save the input features in different training files and build the corresponding vocabularies.

Hi Guillaume !

Yeah I went through this file, and I tried to do that only. The problem, as I mentioned, was size one. I don’t get, as for each token, there is a single feature only(such as POS tag), and that’s why I asked whether the following way to concatenate inputs and features as(given here) :

word|feat1|feat2|...

If this works, then it is much good. Thanks !

This input format is not supported in OpenNMT-tf.

I will try again the ParallelInputter thing, or otherwise, since this is available in OpenNMT-py, switching to that also wouldn’t be a hassle. Thanks for the help !!

1 Like