Large discrepancy in performance on validation set vs. test set, agglutinative languages only

I’m working on NMT for South African languages, using (at this point) Marian NMT and Opennmt-tf frameworks, so far always training models to translate English -> other. The training set sizes are fairly similar for most of the languages; small, but not tiny. I’m using the same model configurations and settings for all languages (so far).

In all experiments across both frameworks, there is a bizarrely huge discrepancy between the performance of trained models on the validation set vs. on the test set, but only on the agglutinative languages. Tswana, Sesotho, Afrikaans etc. all yield some small performance difference between validation and test set, and manually verified reasonable translations but the test set performance on Zulu, Xhosa, Ndebele and Venda is generally lower than 5 BLEU points, sometimes as low as 1, and manual evaluation shows it to be utter nonsense. The performance on the validation sets is generally above 18 BLEU, and fairly reasonable.

What could explain this discrepancy, language-type specific as it is?