we have been doing some testing on that, and the first thing is to normalize the score with an expectation corresponding to source length/target length of the pair. from our experiments, it does not seem possible to strongly categorize sentence with good/bad translation, but it makes sense to use this information as a sampling weight.
If we say the PPL is somehow a good indicator of convergence of the model, I don’t see why the pred_score of a given translation is not so good a classifier for good/bad translation.
it should be relevant, no ?
IMHO: it is relevant, but not discriminant enough. we are talking about translation task for which you can have hundreds of good translation, thousands of not so good ones, a large number of translation somewhat related or context-dependent, and infinite number of mis-aligned translation - I don’t think we can hope that a single score will be able to classify all of these (and knowing that the boundary between these categories is not human-clear). Also, if our system was complete then it would have to know what is a good and bad translation, but we are training systems with 40M+ sentences that are still learning…