Adagrad with learning rate 0.1 performs much worse than sgd with learning rate 1


In my quest for the best possible results, I get a difference between adagrad with learning rate 0.1 and default settings (sgd with learning rate 1) of about 40 BLEU points. Something is wrong with the Adagrad, I guess, but what?

I train with these options
th train.lua -data /home/wiske/tmmt/ennl/nmt/rnn/ennl-train.t7 -save_model /home/wiske/tmmt/ennl/nmt/brnn/rnnsize_750/en2nl.adagrad -optim adagrad -learning_rate 0.1 -encoder_type brnn -rnn_size 750 -gpuid 2 -end_epoch 20

Any hints?


It don’t think Adagrad is commonly used to train this type of model and I’m not sure if the recommended learning rate is any good. How does the perplexity look?

Instead, you should try Adam with the recommended learning rate of 0.0002.

ok, I’ll try that. tx.

perplexity after 20 epochs is still 12.59 :unamused: