Well I tried it and it seems to actually prevent convergence, so I'm holding off on making it a PR for now, until I can get to the bottom of this. I ran OpenNMT with no modifications on the first 100k sentences of the English -> German dataset and got this:
After adding batch normalization I got this:
Also I incorporated the layer-normalization technique described here and got this:
I suspect the reason it's not working well is because we need to be storing population statistics for each time-step (i.e. the mean + variance over the entire dataset on the first iteration, and the second, etc.), though none of the Torch implementations I found did this.
EDIT: the above is wrong. Because of the way the clones are set up, the network does accurately track statistics per-time-step and shares the weight and bias correctly. So I'll keep thinking about why it's not working correctly.