I would like to compare some models which all have a different output language. Is there any consensus on the best approach to do this?
BLEU score is not a good metric since the average quantity of word in a sentence is different between languages. But since I have to handle many languages, I thought about categorizing them by the average number of words they have in a sentence. So that the BLEU score become somewhat more comparable.
Would appreciate if anyone has a better idea!