In the software, there is the option to average the final models in training. This gives me a slight improvement over previous scores, which is nice. I also came across the ensemble branch that is now in development. This gives me the same slight improvement. My questions is, what exactly is the difference between those 2 functions?
As I understand now, average_models simply averages the parameters of the models you input, while ensemble methods average the predictions of the models. Therefore the latter also needs more resources in terms of GPUs. But how different is that, really? What are the main differences?
Also, if I use averaging, I’d like to refer to some NMT systems that successfully applied this technique, but I only come across ensemble methods instead. If someone can point me in the right direction that would be great. Any help is greatly appreciated.