Several requests for other types of dropout: bayesian dropout, batch normalization etc.
I wrote an initial version of the batch normalization here: https://github.com/rachtsingh/OpenNMT
Unfortunately it seems like the jury’s still a little out for what works and what doesn’t. I think this is a better way to parametrize it, which I put here but I’d like to experimentally verify that any of these provide a speedup.
Send a PR, with an example showing some preliminary results and we’ll add it. Also be sure to read the style guide.
Well I tried it and it seems to actually prevent convergence, so I’m holding off on making it a PR for now, until I can get to the bottom of this. I ran OpenNMT with no modifications on the first 100k sentences of the English -> German dataset and got this:
After adding batch normalization I got this:
Also I incorporated the layer-normalization technique described here and got this:
I suspect the reason it’s not working well is because we need to be storing population statistics for each time-step (i.e. the mean + variance over the entire dataset on the first iteration, and the second, etc.), though none of the Torch implementations I found did this.
EDIT: the above is wrong. Because of the way the clones are set up, the network does accurately track statistics per-time-step and shares the weight and bias correctly. So I’ll keep thinking about why it’s not working correctly.
You should disable memory optimization during your experiments when you add new modules:
Otherwise, you should protect
nn.BatchNormalization's input from being shared across clones. See:
Perfect! I didn’t see
MemoryOptimizer (OpenNMT has changed a lot from seq2seq-attn!) and this was definitely the issue.
Now, batch normalization converges (slightly) faster than the baseline model. Unfortunately there’s still a wall clock slowdown, so it’s not a panacea, but I think given the right combination of (complexity of model, dataset size, batch size) it’ll be a useful enough optimization to put it in master. I also think that layer normalization will perform better, so I’ll test that and see if I can make a good set of benchmarks to compare them.
I’ll make a PR (or two) tomorrow. Happy New Years and thanks!
Ah, actually there’s a sticking point - do we need this optimization here? https://github.com/OpenNMT/OpenNMT/blob/master/onmt/modules/Sequencer.lua#L108 (and on lines 120+130)
This is intended to reduce memory usage during evaluation when loading from a saved model, right? If this is intended to be kept around I’ll refactor
LSTM to store the time-step dependent statistics in a packable format independent of cloning before making the PR.
Can you be more specific about why this is an issue? I wouldn’t call it an “optimization” as much as the reason eval time is much simpler than training.