I have several valid translations for each of my source sentences, and am training the model by pairing each source sentence with each translation in the train_features_file and train_labels_file:
Features File — Labels File
source-sentence-1 — target-sentence-1-ref1
source-sentence-1 — target-sentence-1-ref2
source-sentence-2 — target-sentence-2-ref1
source-sentence-2 — target-sentence-2-ref2
…
The eval_features_file and eval_labels_file are set up the same way.
However, when generating a BLEU score during validation using this configuration, the scorer scores the predictions against each translation and retains all of these scores, rather than just using the best score per prediction.
Is there a better way to configuration eval_features_file and eval_labels_file for multi-reference scoring?