Improving performance by replacing Softmax

vince62s · January 21, 2017, 5:50pm

As a matter of fact, it would be good to implement a kind of “sampled softmax” or NCE-like
if we don’t lose too much in accuracy, would definitely make things much quicker to train.

srush · January 21, 2017, 10:12pm

I don’t think it will have much effect on GPU. Depends on the demand for CPU training.

vince62s · January 21, 2017, 10:47pm

oh really ? on TF, it makes a huge difference.
https://www.tensorflow.org/tutorials/seq2seq/#sampled_softmax_and_output_projection

srush · January 21, 2017, 11:22pm

That blog post doesn’t have any numeric speed comparisons that I could see. My guess is that for medium vocab/large models on GPU we’re bound by LSTM computation not softmax.

However, I agree that we should implement something, potentially the technique they describe from Jean et. al., 2014. It is a nice feature to have.

jean.senellart · January 22, 2017, 10:55pm

additional exchanges from gitter:

@srush:

btw @vince62s there is a now an NCE module in DPNN https://github.com/Element-Research/dpnn#ncemodule
we could in theory fork that code…

@vince62s:

@srush wrt to nce perf, we did not test the seq2seq of the link the forum post however we tested the same stuff for LSTM RNNLM. With vocab size of 200k there is a factor x5 to x8 in speed for medium size (I think 2x850) when using a sampled softmax vs full softmax. At 50k vocab might be a little less but I think tehre is some significant gain to get.

@jean.senellart:

For the speed of Softmax on very large voc - cudnn will bring us a huge improvement - see https://github.com/soumith/cudnn.torch/issues/314#issuecomment-273891161

@srush:

[…] it would be great if we could match these numbers https://code.facebook.com/posts/1827693967466780/building-an-efficient-neural-language-model-over-a-billion-words/
actually that paper even has simpler code (https://github.com/facebookresearch/adaptive-softmax)

jean.senellart · January 28, 2017, 1:16pm

first implementation (not totally clean) gives the following results in performance - for the 4x1000 100K vocabulary model of this page.

train:{total:1914.37,
       encoder:{total:364.017,bwd:214.084,fwd:149.853},
       decoder:{total:1481.14,
                fwd:388.34,
                bwd:{total:1092.46,
                     generator:{total:285.313,fwd:168.433,bwd:115.045},
                     criterion:{total:79.9086,bwd:6.55721,fwd:71.7026}}}},
valid:11.1846

which means a speed of x4.5 on the generator - and an end-to-end speedup of x1.6.

ano · February 18, 2017, 6:53am

I think maybe we could try something like Max-Margin and just remove the SoftMax and ClassNLL classifier, If it does not hurt the performance.

vince62s · February 25, 2017, 10:35am

for reference here while looking at the pytorch forum …

Bachstelze · November 15, 2017, 11:56am

What do you think about “Breaking the Softmax Bottleneck: A High-Rank RNN Language Model” https://arxiv.org/pdf/1711.03953.pdf ?
The researches added disrcet latent variables and replaced the softmax with a Mixture of Softmaxes (MoS https://github.com/zihangdai/mos).