Corpus level TER averaging


While working on some scoring issues, I adapted your tercom implementation in python. Looking at your code I am under the impression that, on a corpus level, you are averaging directly by the number of segments. I would rather have done a weighted average based on the number of tokens in each segment. (Please find below the post the code in question.)

Indeed, wouldn’t such a ‘non-weighted’ average cause some misrepresentation on the corpus level? E.g. small segments well scored vs. long segments poorly scored would have the same weight in the final result.)

What was the rationale behind this?

local function calculate_ter(cand, refs)
  local score = 0
  local nb = 0
  local score_breakdown = torch.Tensor(5)
  for k,_ in ipairs(cand) do
    local best_score, best_path, _, _, best_allshift =
              score_sent(k, cand, refs)

    score = score + best_score
    score_breakdown:add(get_score_breakdown(best_path, best_allshift))

    nb = nb + 1
  local score_detail =
      string.format("TER = %.2f (Ins %.1f, Del %.1f, Sub %.1f, Shft %.1f, WdSh %.1f)",
          score*100/nb, score_breakdown[1], score_breakdown[2], score_breakdown[3],
          score_breakdown[4], score_breakdown[5]
  return score/nb, score_detail

Hi François, you are right! in Snover implementation, the document level score is the number of edits divided by the number of words. During the implementation, I made sure that sentence level, we had the expected score, but I overlooked this. Can you please file an issue and I will fix that.

fixed in:

1 Like