Tok_(src|tgt)_case_feature + DynData training, translate <unk> rather than L or N

When training with automatic case features, the translations obtained are producing the <unk> feature everywhere it would rather be a L or a N feature. This is the case at both training time when translating the validation data, and with translate.lua.
C and U case features seems to be properly set at right places.
Of course, since <unk> / L / N are all for unchanged final tokens, this has no real impact on the final translated sentences.

Can you check the feature vocabulary you provided to the training script?

I suspect it contains l and n instead of L and N.

During training, I didn’t provide any feature vocabulary, since it’s supposed to be automatic with the tok_(src|tgt)_case_feature options.
See here : https://github.com/OpenNMT/OpenNMT/issues/384#issuecomment-333027478

In this training, I already got this problem with the validation set translation.

During translation, the provided sentences are well enriched with CLNU uppercased chars.

Thank you for testing these features.

There was an issue when generating the case vocabulary. Fixed by:

Did this have an impact on the training quality ? Or was it just a problem on the translation output ?

It impacted the training as well as all “L” and “N” were mapped to a single “<unk>” token.

1 Like