MFCC feature extraction Speech to Text

Hi,
I have done preprocessing using MFCC , but model training get stuck in early steps.

python3 train.py -model_type audio -enc_rnn_size 2048 -dec_rnn_size 1024 -audio_enc_pooling 1,1,1,1,2,2,2,2 -dropout 0.1 -enc_layers 8 -dec_layers 6 -rnn_type LSTM -data data/speech/demo -save_model models-global_attention mlp -batch_size 8 -save_checkpoint 10000 -optim adam -max_grad_norm 100 -learning_rate 0.01 -learning_rate_decay 0.5 -decay_method rsqrt -train_steps 150000 -encoder_type brnn -decoder_type rnn -normalization tokens -bridge -window_size 0.025 -image_channel_size 3 -gpu_ranks 0 -world_size 1

Step 30000/150000; acc: 15.65; ppl: 268.82; xent: 5.59; lr: 0.00017;
Step 40000/150000; acc: 17.03; ppl: 253.23; xent: 5.53; lr: 0.00015;
Step 50000/150000; acc: 17.06; ppl: 229.16; xent: 5.43; lr: 0.00013;
Step 60000/150000; acc: 17.52; ppl: 225.95; xent: 5.42; lr: 0.00012

TRAINING accuracy improving very slowly.
What should be done for faster convergence of training?