I’m not sure how to make this feature compatible with early stopping based on the evaluation score. Should we only consider the score of the first dataset? Take the average of all scores? Or disable early stopping entirely?
Early stopping should remain for consistency with other experiments that use it. Ideally, there can be an argument to select between the first and second options (first dataset only vs. average of all datasets). If only one option is implemented, then maybe the average of all datasets. Having multiple datasets for evaluation, I suppose we expect the system to hopefully work well on all of them.
I think having multiple datasets could be exploitable in terms of early stopping. Maybe another option could be to take the minimum score from all datasets? So, training would stop as soon as any of the datasets degrades. Or the maximum, in which case the training would optimise at least one dataset…
I guess having different validation data sources would spare us the need of concatenating them in the file system. Something similar as it is done with the training data sources. I don’t know how the Tensorflow version is implemented but the Pytorch version could be easily adapted I guess…