Adagrad with learning rate 0.1 performs much worse than sgd with learning rate 1

vincent · April 3, 2018, 11:38am

Hi,

In my quest for the best possible results, I get a difference between adagrad with learning rate 0.1 and default settings (sgd with learning rate 1) of about 40 BLEU points. Something is wrong with the Adagrad, I guess, but what?

I train with these options
th train.lua -data /home/wiske/tmmt/ennl/nmt/rnn/ennl-train.t7 -save_model /home/wiske/tmmt/ennl/nmt/brnn/rnnsize_750/en2nl.adagrad -optim adagrad -learning_rate 0.1 -encoder_type brnn -rnn_size 750 -gpuid 2 -end_epoch 20

Any hints?

guillaumekln · April 5, 2018, 8:33am

Hello,

It don’t think Adagrad is commonly used to train this type of model and I’m not sure if the recommended learning rate is any good. How does the perplexity look?

Instead, you should try Adam with the recommended learning rate of 0.0002.

vincent · April 5, 2018, 9:22am

ok, I’ll try that. tx.

vincent · April 6, 2018, 9:54am

perplexity after 20 epochs is still 12.59