Compare BLEU score between models


I would like to compare some models which all have a different output language. Is there any consensus on the best approach to do this?

BLEU score is not a good metric since the average quantity of word in a sentence is different between languages. But since I have to handle many languages, I thought about categorizing them by the average number of words they have in a sentence. So that the BLEU score become somewhat more comparable.

Would appreciate if anyone has a better idea!

Best regards,

1 Like


Why do you need to compare models with different output language?

They can’t be compared with a scoring metric since the reference output is not the same.

1 Like

That’s a legitimate question. This is somewhat to evaluate the level of translation provided for each of the languages the non-profit organism, I’m working with, has.

I’m not searching for something perfect. Just something close enough to comparable.

So far, I’m planning on assessing the average number of words translated per language compared to a specific source language. Then the ones that are closed on that aspect will be considered comparable.

This is somewhat to enable us to evaluate the prefilling of translation and carefully decided on which language to put more effort.

1 Like