Improving performance by replacing Softmax

As a matter of fact, it would be good to implement a kind of “sampled softmax” or NCE-like
if we don’t lose too much in accuracy, would definitely make things much quicker to train.

I don’t think it will have much effect on GPU. Depends on the demand for CPU training.

oh really ? on TF, it makes a huge difference.
https://www.tensorflow.org/tutorials/seq2seq/#sampled_softmax_and_output_projection

That blog post doesn’t have any numeric speed comparisons that I could see. My guess is that for medium vocab/large models on GPU we’re bound by LSTM computation not softmax.

However, I agree that we should implement something, potentially the technique they describe from Jean et. al., 2014. It is a nice feature to have.

additional exchanges from gitter:

@srush:

btw @vince62s there is a now an NCE module in DPNN https://github.com/Element-Research/dpnn#ncemodule
we could in theory fork that code…

@vince62s:

@srush wrt to nce perf, we did not test the seq2seq of the link the forum post however we tested the same stuff for LSTM RNNLM. With vocab size of 200k there is a factor x5 to x8 in speed for medium size (I think 2x850) when using a sampled softmax vs full softmax. At 50k vocab might be a little less but I think tehre is some significant gain to get.

@jean.senellart:

For the speed of Softmax on very large voc - cudnn will bring us a huge improvement - see https://github.com/soumith/cudnn.torch/issues/314#issuecomment-273891161

@srush:

[…] it would be great if we could match these numbers https://code.facebook.com/posts/1827693967466780/building-an-efficient-neural-language-model-over-a-billion-words/
actually that paper even has simpler code (https://github.com/facebookresearch/adaptive-softmax)

first implementation (not totally clean) gives the following results in performance - for the 4x1000 100K vocabulary model of this page.

train:{total:1914.37,
       encoder:{total:364.017,bwd:214.084,fwd:149.853},
       decoder:{total:1481.14,
                fwd:388.34,
                bwd:{total:1092.46,
                     generator:{total:285.313,fwd:168.433,bwd:115.045},
                     criterion:{total:79.9086,bwd:6.55721,fwd:71.7026}}}},
valid:11.1846

which means a speed of x4.5 on the generator - and an end-to-end speedup of x1.6.

I think maybe we could try something like Max-Margin and just remove the SoftMax and ClassNLL classifier, If it does not hurt the performance.

1 Like

for reference here while looking at the pytorch forum …

What do you think about “Breaking the Softmax Bottleneck: A High-Rank RNN Language Model” https://arxiv.org/pdf/1711.03953.pdf ?
The researches added disrcet latent variables and replaced the softmax with a Mixture of Softmaxes (MoS https://github.com/zihangdai/mos).