Learning rate reduced to 0 after start_decay_steps values

I am trying to train model with 50k data.

This is the training command -

python train.py -data data/demo -save_model demo-model -batch_size 256 -train_steps 20000 -report_every 500 -save_checkpoint_steps 5000 -optim adam -learning_rate 0.001 -learning_rate_decay 0.0002 -start_decay_steps 10000 -decay_steps 2000 -world_size 1 -gpu_ranks 0

Surpricingly after train steps 9500, lr value become 0. Any idea why this happened?

[2018-11-19 14:16:17,246 INFO] Step 9000/20000; acc: 94.06; ppl: 1.22; xent: 0.20; lr: 0.00100; 4383/4090 tok/s; 11191 sec
[2018-11-19 14:16:37,362 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 47012
[2018-11-19 14:20:25,384 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 47012
[2018-11-19 14:24:13,719 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 47012
[2018-11-19 14:26:34,727 INFO] Step 9500/20000; acc: 91.82; ppl: 1.38; xent: 0.32; lr: 0.00100; 5093/4553 tok/s; 11809 sec
[2018-11-19 14:28:01,937 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 47012
[2018-11-19 14:31:50,261 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 47012
[2018-11-19 14:35:38,673 INFO] Loading train dataset from data/demo.train.0.pt, number of examples: 47012
[2018-11-19 14:36:53,552 INFO] Step 10000/20000; acc: 95.38; ppl: 1.18; xent: 0.16; lr: 0.00000; 5507/5266 tok/s; 12428 sec
[2018-11-19 14:36:53,723 INFO] Loading valid dataset from data/demo.valid.0.pt, number of examples: 9411
[2018-11-19 14:37:29,796 INFO] Validation perplexity: 145.798
[2018-11-19 14:37:29,796 INFO] Validation accuracy: 51.0478
[2018-11-19 14:37:29,797 INFO] Saving checkpoint demo-model_step_10000.pt

Also I believe “acc: 95.38” this refers to training accuracy and Validation accuracy: “51.0478”

Why such a big difference between both?

First 50k dataset is very small, don’t expect good results with this.
Then if you do the maths you will wee that 0.001 x 0.0002 gives a very small number.
Don’t decay if you just test and use adam.
Your accuracy is too high, leading to the conclucsion your dataset is made of similar data.

1 Like

Thanks @Vincent. Can you suggest hints for optimization method selection?

Any other paramters which can help in improving accuracy. I am experimenting on 50k, I have up to 800k of data.

For your toy set just use
python train.py -data data/demo -save_model demo-model -batch_size 256 -train_steps 20000 -report_every 500 -save_checkpoint_steps 5000 -optim adam -learning_rate 0.0002 -world_size 1 -gpu_ranks 0

with more Data, look at the examples in the doc.