How to compare two testsets

alokaf · March 16, 2021, 7:06pm

I have two datasets which gives a BLEU score of 20.93 and 26.40, when evaluated a NMT model.

Both testsets have been created from the same domain data, and the number of sentences are 1468 and 1432. The statistics are as follows.

test_A.src
Maximum sentence Length : 40
Minimum sentence Length : 8
Mean sentence Length : 19.78
Median sentence Length : 19.0
Mode sentence Length : 8

test_B.src
Maximum sentence Length : 49
Minimum sentence Length : 4
Mean sentence Length : 18.29
Median sentence Length : 17.0
Mode sentence Length : 5

I need to find a way of reasoning for the difference in the score. What kind of analysis would help to justify the difference?

francoishernandez · March 16, 2021, 8:32pm

Maybe you could visualize some distributions (for instance length, sentence BLEU).
The mode (and lower mean/median) of your test_B being lower than that of test_A may indicate that test_B is ‘easier’ hence the difference.
Also less OOV in test_B might increase the gap here. By the way it seems like a lot of OOV compared to the size of your sets.