Tok_(src|tgt)_case_feature + DynData training, translate <unk> rather than L or N

(Etienne Monneret) #1

When training with automatic case features, the translations obtained are producing the <unk> feature everywhere it would rather be a L or a N feature. This is the case at both training time when translating the validation data, and with translate.lua.
C and U case features seems to be properly set at right places.
Of course, since <unk> / L / N are all for unchanged final tokens, this has no real impact on the final translated sentences.

(Guillaume Klein) #2

Can you check the feature vocabulary you provided to the training script?

I suspect it contains l and n instead of L and N.

(Etienne Monneret) #3

During training, I didn’t provide any feature vocabulary, since it’s supposed to be automatic with the tok_(src|tgt)_case_feature options.
See here :

In this training, I already got this problem with the validation set translation.

During translation, the provided sentences are well enriched with CLNU uppercased chars.

(Guillaume Klein) #4

Thank you for testing these features.

There was an issue when generating the case vocabulary. Fixed by:

(Etienne Monneret) #5

Did this have an impact on the training quality ? Or was it just a problem on the translation output ?

(Guillaume Klein) #6

It impacted the training as well as all “L” and “N” were mapped to a single “<unk>” token.