I have two datasets which gives a BLEU score of 20.93 and 26.40, when evaluated a NMT model.
Both testsets have been created from the same domain data, and the number of sentences are 1468 and 1432. The statistics are as follows.
test_A.src
Maximum sentence Length : 40
Minimum sentence Length : 8
Mean sentence Length : 19.78
Median sentence Length : 19.0
Mode sentence Length : 8
test_B.src
Maximum sentence Length : 49
Minimum sentence Length : 4
Mean sentence Length : 18.29
Median sentence Length : 17.0
Mode sentence Length : 5
The overlapping words and out-of-vocabulary words with the training set and the respective testsets are as follows.
A B
|TrainVoc/Test Overlap 3623|3552|
|Train Voc/ Test OOV 1718|1505|
I need to find a way of reasoning for the difference in the score. What kind of analysis would help to justify the difference?