Has anyone tried to dig in the average prediction score (average = per word) and try to build a confidence score ?
This does not seem to be as straight forward as it should be in theory …
At times I have very low score (close to zero) for bad translations and very negative scores for translation that are not so bad …
Any insight ?
we have been doing some testing on that, and the first thing is to normalize the score with an expectation corresponding to source length/target length of the pair. from our experiments, it does not seem possible to strongly categorize sentence with good/bad translation, but it makes sense to use this information as a sampling weight.
I am quite unclear on this.
If we say the PPL is somehow a good indicator of convergence of the model, I don’t see why the pred_score of a given translation is not so good a classifier for good/bad translation.
it should be relevant, no ?
IMHO: it is relevant, but not discriminant enough. we are talking about translation task for which you can have hundreds of good translation, thousands of not so good ones, a large number of translation somewhat related or context-dependent, and infinite number of mis-aligned translation - I don’t think we can hope that a single score will be able to classify all of these (and knowing that the boundary between these categories is not human-clear). Also, if our system was complete then it would have to know what is a good and bad translation, but we are training systems with 40M+ sentences that are still learning…