Support other forms of dropout/regularization

(srush) #1

Several requests for other types of dropout: bayesian dropout, batch normalization etc.

(Rachit Singh) #2

I wrote an initial version of the batch normalization here:

Unfortunately it seems like the jury’s still a little out for what works and what doesn’t. I think this is a better way to parametrize it, which I put here but I’d like to experimentally verify that any of these provide a speedup.

(srush) #3

Send a PR, with an example showing some preliminary results and we’ll add it. Also be sure to read the style guide.

(Rachit Singh) #4

Well I tried it and it seems to actually prevent convergence, so I’m holding off on making it a PR for now, until I can get to the bottom of this. I ran OpenNMT with no modifications on the first 100k sentences of the English -> German dataset and got this:

After adding batch normalization I got this:

Also I incorporated the layer-normalization technique described here and got this:

I suspect the reason it’s not working well is because we need to be storing population statistics for each time-step (i.e. the mean + variance over the entire dataset on the first iteration, and the second, etc.), though none of the Torch implementations I found did this.

EDIT: the above is wrong. Because of the way the clones are set up, the network does accurately track statistics per-time-step and shares the weight and bias correctly. So I’ll keep thinking about why it’s not working correctly.

(Guillaume Klein) #5

You should disable memory optimization during your experiments when you add new modules: -disable_mem_optimization.

Otherwise, you should protect nn.BatchNormalization's input from being shared across clones. See:

(Rachit Singh) #6

Perfect! I didn’t see MemoryOptimizer (OpenNMT has changed a lot from seq2seq-attn!) and this was definitely the issue.

Now, batch normalization converges (slightly) faster than the baseline model. Unfortunately there’s still a wall clock slowdown, so it’s not a panacea, but I think given the right combination of (complexity of model, dataset size, batch size) it’ll be a useful enough optimization to put it in master. I also think that layer normalization will perform better, so I’ll test that and see if I can make a good set of benchmarks to compare them.

I’ll make a PR (or two) tomorrow. Happy New Years and thanks!

(Rachit Singh) #7

Ah, actually there’s a sticking point - do we need this optimization here? (and on lines 120+130)

This is intended to reduce memory usage during evaluation when loading from a saved model, right? If this is intended to be kept around I’ll refactor LSTM to store the time-step dependent statistics in a packable format independent of cloning before making the PR.

(srush) #8

Can you be more specific about why this is an issue? I wouldn’t call it an “optimization” as much as the reason eval time is much simpler than training.

(jean.senellart) #9

Hi @rachtsingh, I am interested to look at layer norm/batch norm. What is the status on your side? can I help to move it further?