While working on some scoring issues, I adapted your tercom implementation in python. Looking at your code I am under the impression that, on a corpus level, you are averaging directly by the number of segments. I would rather have done a weighted average based on the number of tokens in each segment. (Please find below the post the code in question.)
Indeed, wouldn’t such a ‘non-weighted’ average cause some misrepresentation on the corpus level? E.g. small segments well scored vs. long segments poorly scored would have the same weight in the final result.)
What was the rationale behind this?
local function calculate_ter(cand, refs) local score = 0 local nb = 0 local score_breakdown = torch.Tensor(5) for k,_ in ipairs(cand) do local best_score, best_path, _, _, best_allshift = score_sent(k, cand, refs) score = score + best_score score_breakdown:add(get_score_breakdown(best_path, best_allshift)) nb = nb + 1 end score_breakdown:div(nb) local score_detail = string.format("TER = %.2f (Ins %.1f, Del %.1f, Sub %.1f, Shft %.1f, WdSh %.1f)", score*100/nb, score_breakdown, score_breakdown, score_breakdown, score_breakdown, score_breakdown ) return score/nb, score_detail end