Why does learning rate change?

owen · April 14, 2020, 4:49am

When I train my model, the learning rate increases at a constant rate until 0.001, then decreases at a constant rate afterwards. Why is this? Why isn’t the learning rate set to be constant?

My config is:

model_dir: my_model

data:
  train_features_file: source_train.sp
  train_labels_file: target_train.sp
  eval_features_file: source_val.sp
  eval_labels_file: target_val.sp
  source_vocabulary: final.vocab
  target_vocabulary: final.vocab

train:
  save_checkpoints_steps: 1000

eval:
  external_evaluators: BLEU

infer:
  batch_size: 32

The command I use for training is:
onmt-main --model_type Transformer --config config.yml --auto_config train --with_eval

guillaumekln · April 14, 2020, 7:08am

You are using the Transformer configuration. See section 5.3 of the paper:

owen · April 14, 2020, 6:47pm

Ah, I see. Thank you

At what stage of the learning rate would I want to switch to my domain-specific corpus for transfer learning? My parent corpus is 2.8M, and my domain specific has 40K. I was going to switch over when the learning rate is decreasing and around 0.0002, but I really have no idea how to choose when. For reference, the learning rate peaks at 0.001 (edited my original post to reflect this).

guillaumekln · April 14, 2020, 7:02pm

I think it’s easier to reason in number of training steps. Maybe start with 100k training steps on generic data before starting the adaptation. As you gain more experience, you will probably find a better value.

owen · April 14, 2020, 9:03pm

If I wait until 100K steps, the learning rate will have decreased to the minimum (0.000001). Is it okay to start training the domain-specific corpus at the minimum learning rate?

guillaumekln · April 15, 2020, 7:02am

Here is the learning value that you get after 100k steps with the Transformer configuration:

>>> opennmt.schedules.NoamDecay(2.0, 512, 8000)(100000)
<tf.Tensor: shape=(), dtype=float32, numpy=0.00027950708>

Etienne38 · April 17, 2020, 8:29am

By changing the learning rate, the training procedure is doing a kind of “simulated annealing”: