How to compare two testsets

I have two datasets which gives a BLEU score of 20.93 and 26.40, when evaluated a NMT model.

Both testsets have been created from the same domain data, and the number of sentences are 1468 and 1432. The statistics are as follows.

test_A.src
Maximum sentence Length : 40
Minimum sentence Length : 8
Mean sentence Length : 19.78
Median sentence Length : 19.0
Mode sentence Length : 8

test_B.src
Maximum sentence Length : 49
Minimum sentence Length : 4
Mean sentence Length : 18.29
Median sentence Length : 17.0
Mode sentence Length : 5

The overlapping words and out-of-vocabulary words with the training set and the respective testsets are as follows.
A B
|TrainVoc/Test Overlap 3623|3552|
|Train Voc/ Test OOV 1718|1505|

I need to find a way of reasoning for the difference in the score. What kind of analysis would help to justify the difference?

Maybe you could visualize some distributions (for instance length, sentence BLEU).
The mode (and lower mean/median) of your test_B being lower than that of test_A may indicate that test_B is ‘easier’ hence the difference.
Also less OOV in test_B might increase the gap here. By the way it seems like a lot of OOV compared to the size of your sets.