Better bleu score when warmup_steps greater than train_steps

I am using the transformer architecture to train a translation system.

Nevertheless, when I used train_steps = 2000 and warmup_steps = 8000 I got 23 BLEU points and when I used train_steps = 2000 and warmup_steps = 80 I got a BLEU score of 2! Is there any reason behind that?

Here is part of the config.yaml file


valid_steps: 100

train_steps: 2000

# Batching

queue_size: 10000

bucket_size: 32768

world_size: 4

gpu_ranks: [0, 1, 2, 3]

batch_type: "tokens"

batch_size: 4096

valid_batch_size: 8

max_generator_batches: 2

accum_count: [4]

accum_steps: [0]

# Optimization

model_dtype: "fp32"

optim: "adam"

learning_rate: 2

warmup_steps: 8000

decay_method: "noam"

adam_beta2: 0.998

max_grad_norm: 0

label_smoothing: 0.1

param_init: 0

param_init_glorot: true

normalization: "tokens"

# Model

encoder_type: transformer

decoder_type: transformer

position_encoding: true

enc_layers: 6

dec_layers: 6

heads: 4

rnn_size: 512

word_vec_size: 512

transformer_ff: 2048

dropout_steps: [0]

dropout: [0.1]

attention_dropout: [0.1]

Hi Marwa!

Actually, warmup_steps cannot be more than the whole train_steps as the former are part of the latter. The result you are getting is because 80 warm-up steps are too few for the Transformer model.

You can check more explanations about what warm-up steps do here and here.

Moreover, I cannot but refer to the original paper, and a couple of useful resources:

All the best,
Yasmin

1 Like