Support other forms of dropout/regularization

srush · December 23, 2016, 2:02pm

Several requests for other types of dropout: bayesian dropout, batch normalization etc.

rachtsingh · December 28, 2016, 9:41pm

I wrote an initial version of the batch normalization here: https://github.com/rachtsingh/OpenNMT

Unfortunately it seems like the jury’s still a little out for what works and what doesn’t. I think this is a better way to parametrize it, which I put here but I’d like to experimentally verify that any of these provide a speedup.

srush · December 28, 2016, 10:10pm

Send a PR, with an example showing some preliminary results and we’ll add it. Also be sure to read the style guide.

rachtsingh · December 30, 2016, 7:47pm

Well I tried it and it seems to actually prevent convergence, so I’m holding off on making it a PR for now, until I can get to the bottom of this. I ran OpenNMT with no modifications on the first 100k sentences of the English -> German dataset and got this:

gist.github.com

https://gist.github.com/rachtsingh/e97b4be011f4b86c47956848725e8095

baseline.log

Loading data from 'data/small-train.t7'...	
 * vocabulary size: source = 50004; target = 50004	
 * additional features: source = 0; target = 0	
 * maximum sequence length: source = 50; target = 51	
 * number of training sentences: 100000	
 * maximum batch size: 64	
Building model...	
 * using input feeding	
Initializing parameters...	
 * number of parameters: 84814004

This file has been truncated. show original

After adding batch normalization I got this:

gist.github.com

https://gist.github.com/rachtsingh/d856ccdf71f8885b7ea535d820c9d7ef

batch_norm.log

Loading data from 'data/small-train.t7'...	
 * vocabulary size: source = 50004; target = 50004	
 * additional features: source = 0; target = 0	
 * maximum sequence length: source = 50; target = 51	
 * number of training sentences: 100000	
 * maximum batch size: 64	
Building model...	
 * using input feeding	
Initializing parameters...	
 * number of parameters: 84834004

This file has been truncated. show original

Also I incorporated the layer-normalization technique described here and got this:

gist.github.com

https://gist.github.com/rachtsingh/1c372a62f420fa972e8fed2a50c2da7b

layer_norm.log

Loading data from 'data/small-train.t7'...	
 * vocabulary size: source = 50004; target = 50004	
 * additional features: source = 0; target = 0	
 * maximum sequence length: source = 50; target = 51	
 * number of training sentences: 100000	
 * maximum batch size: 64	
Building model...	
 * using input feeding	
Initializing parameters...	
 * number of parameters: 84818004

This file has been truncated. show original

I suspect the reason it’s not working well is because we need to be storing population statistics for each time-step (i.e. the mean + variance over the entire dataset on the first iteration, and the second, etc.), though none of the Torch implementations I found did this.

EDIT: the above is wrong. Because of the way the clones are set up, the network does accurately track statistics per-time-step and shares the weight and bias correctly. So I’ll keep thinking about why it’s not working correctly.

guillaumekln · December 31, 2016, 12:31pm

You should disable memory optimization during your experiments when you add new modules: -disable_mem_optimization.

Otherwise, you should protect nn.BatchNormalization's input from being shared across clones. See:

github.com

OpenNMT/OpenNMT/blob/master/onmt/utils/MemoryOptimizer.lua#L12


--]]
local MemoryOptimizer = torch.class('MemoryOptimizer')


-- We cannot share every internal tensors (that is why we need to replicate in the first place).
-- The general rule is to not share tensors whose content is used in the backward pass
-- We allow the sharing when only the size is queried as it is constant during the
-- forward and backward passes.


-- We cannot share the output of these modules as they use it in their backward pass.
local protectOutput = {
'nn.Sigmoid',
'nn.SoftMax',
'nn.Tanh'
}


-- We cannot share the input of these modules as they use it in their backward pass.
local protectInput = {
'nn.Linear',
'nn.JoinTable',
'nn.CMulTable',
'nn.MM'

rachtsingh · January 1, 2017, 6:23am

Perfect! I didn’t see MemoryOptimizer (OpenNMT has changed a lot from seq2seq-attn!) and this was definitely the issue.

Now, batch normalization converges (slightly) faster than the baseline model. Unfortunately there’s still a wall clock slowdown, so it’s not a panacea, but I think given the right combination of (complexity of model, dataset size, batch size) it’ll be a useful enough optimization to put it in master. I also think that layer normalization will perform better, so I’ll test that and see if I can make a good set of benchmarks to compare them.

I’ll make a PR (or two) tomorrow. Happy New Years and thanks!

rachtsingh · January 1, 2017, 10:36am

Ah, actually there’s a sticking point - do we need this optimization here? https://github.com/OpenNMT/OpenNMT/blob/master/onmt/modules/Sequencer.lua#L108 (and on lines 120+130)

This is intended to reduce memory usage during evaluation when loading from a saved model, right? If this is intended to be kept around I’ll refactor LSTM to store the time-step dependent statistics in a packable format independent of cloning before making the PR.

srush · January 1, 2017, 10:08pm

Can you be more specific about why this is an issue? I wouldn’t call it an “optimization” as much as the reason eval time is much simpler than training.

jean.senellart · May 3, 2017, 3:33pm

Hi @rachtsingh, I am interested to look at layer norm/batch norm. What is the status on your side? can I help to move it further?