POS tagging/Word Features and BPE encoding tutorial needed for opennnmt python

pulkitjoshi · June 23, 2021, 9:30am

Hi, I am new to nmt can someone point to a tutorial on how to use opennmt-py for pos tagging/ Word features and bpe to train a single model . And if you can point to complete opennmt-py tutorial that would be really helpful. Even if someone can give a set of steps as a guide that would be helpful as well.
Thanks in advance.

SamuelLacombe · June 24, 2021, 1:24pm

Here’s an old tutorial:

POS tag

I haven’t tried it myself, so i’m not sure if it’s outdated.

miguelknals · June 24, 2021, 9:57pm

Hi

Hope @panosk is arround here and can help us, but once I tried to use features for case control in newer opennmt-py version, and was not working. I would first try to see if it works.

Be aware you will dive in a not easy problem, as tagging your corpus is not an easy task. You need a reliable tagger for source and target.

Also, as many things done in NMT, improvements sometimes are difficult to mesure and I suspect this is one of them (score will not change very much, your tagger wont be perfect, so you wont be sure if it makes a difference or not).

But my post here is about the BPE, not sure if you can tag a word that has been bpenized. I mean you can tag a word with a POS, but how you will handle the word once BPE has split it?

Have a nice day!
Miguel

panosk · June 25, 2021, 7:59am

Hi,

The POS tagging tutorial is indeed outdated. It was written for OpenNMT-lua and RNN models, which allowed to embed the factors directly in the words of the training corpus. Also, it assumed no BPE was used.
For this to work in the current OpenNMT-tf/py versions, the factors (POS tags in this case) need to be in a separate file, and this is straightforward: the tagger actually adds the POS tags in a separate file anyway, so you just need to omit the merging step (combine factors) in the tutorial. But the problem is when BPE is applied as this requires another step that will repeat the POS tag for each subword created for every word.
IMO, it’s too much of a trouble, and transformers perform extremely well regarding word agreements, so I’m not sure POS tagging will give any noticable improvement.