Is there a way to score the validation set against multiple references?

I have several valid translations for each of my source sentences, and am training the model by pairing each source sentence with each translation in the train_features_file and train_labels_file:

Features FileLabels File
source-sentence-1 — target-sentence-1-ref1
source-sentence-1 — target-sentence-1-ref2
source-sentence-2 — target-sentence-2-ref1
source-sentence-2 — target-sentence-2-ref2

The eval_features_file and eval_labels_file are set up the same way.

However, when generating a BLEU score during validation using this configuration, the scorer scores the predictions against each translation and retains all of these scores, rather than just using the best score per prediction.

Is there a better way to configuration eval_features_file and eval_labels_file for multi-reference scoring?

Heyho Michael A. Martin,
You can score the translation generation with NLTK against multiple references. That is not a way to configure the evaluation, but we could write a validation script with it.

Greetings from the translation space

Multi-reference scoring is not supported in OpenNMT-tf.