Tok_(src|tgt)_case_feature + DynData training, translate <unk> rather than L or N

Etienne38 · October 6, 2017, 7:37am

When training with automatic case features, the translations obtained are producing the <unk> feature everywhere it would rather be a L or a N feature. This is the case at both training time when translating the validation data, and with translate.lua.
C and U case features seems to be properly set at right places.
Of course, since <unk> / L / N are all for unchanged final tokens, this has no real impact on the final translated sentences.

guillaumekln · October 6, 2017, 10:30am

Can you check the feature vocabulary you provided to the training script?

I suspect it contains l and n instead of L and N.

Etienne38 · October 6, 2017, 11:23am

During training, I didn’t provide any feature vocabulary, since it’s supposed to be automatic with the tok_(src|tgt)_case_feature options.
See here : https://github.com/OpenNMT/OpenNMT/issues/384#issuecomment-333027478

In this training, I already got this problem with the validation set translation.

During translation, the provided sentences are well enriched with CLNU uppercased chars.

guillaumekln · October 6, 2017, 2:19pm

Thank you for testing these features.

There was an issue when generating the case vocabulary. Fixed by:

Etienne38 · October 6, 2017, 2:34pm

Did this have an impact on the training quality ? Or was it just a problem on the translation output ?

guillaumekln · October 6, 2017, 2:39pm

It impacted the training as well as all “L” and “N” were mapped to a single “<unk>” token.