Why is it hard for bidirectional to learn to output input

I reduced nmtmedium emb size to 128, and tokens are individual characters (so less then 40 of them), but it still doesn’t learn to just output vanilla input even with few hours of training. I thought it would be fairly easy for it to learn this (input is up to 5 tokens long), but it’s not, so my understanding is incorrect. What am I missing?

What is your training configuration?

readymade NMTMedium with 128-embedding

optimizer: GradientDescentOptimizer
learning_rate: 1.0
param_init: 0.1
clip_gradients: 5.0
decay_type: exponential_decay
decay_rate: 0.7
decay_steps: 7000
start_decay_steps: 500
beam_width: 5
maximum_iterations: 250

batch_size: 32
bucket_width: 1
save_checkpoints_steps: 10000
save_summary_steps: 10000
train_steps: 100000
maximum_features_length: 40
maximum_labels_length: 40

Maybe there is not enough data to train a NMTMedium model. Or the learning rate is decaying too fast.