Evaluation on multiple datasets at once

When training --with-eval, is it possible to run the evaluation on multiple datasets?

Something like:

data:
  [...]

  eval_features_file: 
      - feat.valid1.tok
      - feat.valid2.tok
  eval_labels_file: 
      - lab.valid1.tok
      - lab.valid.2.tok
1 Like

This is not implemented for now.

I’m not sure how to make this feature compatible with early stopping based on the evaluation score. Should we only consider the score of the first dataset? Take the average of all scores? Or disable early stopping entirely?

Any thoughts on that?

1 Like

Early stopping should remain for consistency with other experiments that use it. Ideally, there can be an argument to select between the first and second options (first dataset only vs. average of all datasets). If only one option is implemented, then maybe the average of all datasets. Having multiple datasets for evaluation, I suppose we expect the system to hopefully work well on all of them.

Kind regards,
Yasmin

I think having multiple datasets could be exploitable in terms of early stopping. Maybe another option could be to take the minimum score from all datasets? So, training would stop as soon as any of the datasets degrades. Or the maximum, in which case the training would optimise at least one dataset…

Hi all,

I guess having different validation data sources would spare us the need of concatenating them in the file system. Something similar as it is done with the training data sources. I don’t know how the Tensorflow version is implemented but the Pytorch version could be easily adapted I guess…