Difference between averaging and ensembling models

In the software, there is the option to average the final models in training. This gives me a slight improvement over previous scores, which is nice. I also came across the ensemble branch that is now in development. This gives me the same slight improvement. My questions is, what exactly is the difference between those 2 functions?

As I understand now, average_models simply averages the parameters of the models you input, while ensemble methods average the predictions of the models. Therefore the latter also needs more resources in terms of GPUs. But how different is that, really? What are the main differences?

Also, if I use averaging, Iā€™d like to refer to some NMT systems that successfully applied this technique, but I only come across ensemble methods instead. If someone can point me in the right direction that would be great. Any help is greatly appreciated.

Averaging approach is introduced here: The AMU-UEDIN Submission to the WMT16 News Translation Task: Attention-based NMT Models as Feature Functions in Phrase-based SMT. Ensembling is totally different and potentially far more powerful: you take different models trained with multiple seeds (or even parameters) and during beam_search, the different models contribute to the score of each hypothesis.