Odd validation curve as training batch size changes in OpenNMT-tf 2.18.1

I am seeing some strange behavior during training. I train in many small (~50k sentence) files, running a validation evaluation after a single pass through each file. I use a transformer with relative position. My source inputter is a SequenceRecordInputter, and my target is WordEmbedder. The systems train for days, slowly appearing to converge.

I occasionally had out-of-memory problems (but not on the three cases I’ll show below). I changed the training batch size from the default (about 3000) to 111, and the training runs picked up the batch_size reduction in the configuration yml, as the bash script cycled through training files. Unexpectedly, the validation score went up 1-3 BLEU points, and the Loss dropped dramatically.

I returned to the large batch_size, and validation scores got worse. I repeated this and got another valley in Loss (or peak in BLEU). For each of the nine models I was training!

The figure below has validation loss curves for three models. In each curve you can see the three drops in Loss caused by three decreases in batch size.

image

I don’t understand this behavior. With any nonconvex optimization there’s the chance of local optima and batch size dependence, but I would expect that mostly to affect the validation slope. In particular, the dramatic, repeated drop in quality when I increased batch_size is a mystery to me.

Have others seen or have an explanation for this behavior? Is it perhaps related to a bug in OpenNMT-tf 2.18.1 (“Support” could well be a better category than “Research”)?

Thanks in advance for any ideas,
Gerd

Can you post your training and model configurations?

Certainly. Apologies for the delay. I was away.

Training command:

onmt-main \
--config REL_L7_N1_W20_M24_D0/config.yml \
--model REL_L7_N1_W20_M24_D0/model.py \
--auto_config \
train --with_eval \
--num_gpus 2

model.py:

import opennmt as onmt

def model():
  return onmt.models.Transformer(
      source_inputter=
          onmt.inputters.SequenceRecordInputter(input_depth=480),
      target_inputter=onmt.inputters.WordEmbedder(embedding_size=512),
      num_layers=7,
      num_units=1024,
      num_heads=32,
      ffn_inner_dim=64,
      position_encoder_class=None,
      maximum_relative_position=24,
      dropout=0.00,
      attention_dropout=0.00)

config.yml:

model_dir: REL_L7_N1_W20_M24_D0
data:
  train_features_file:
    REL_L7_N1_W20_M24_D0/data/subtrain.ru.vtf.gz
  train_labels_file: REL_L7_N1_W20_M24_D0/data/subtrain.sp32k.en
  eval_features_file:
    REL_L7_N1_W20_M24_D0/data/valid.ru.vtf.gz
  eval_labels_file: REL_L7_N1_W20_M24_D0/data/valid.sp32k.en
  target_vocabulary: REL_L7_N1_W20_M24_D0/data/wmt19-ruen-en-32k.onmt.vocab
train:
  batch_size: 111
  save_checkpoints_steps: 10000
  maximum_features_length: 2500
  maximum_labels_length: 200
  max_step: null
  single_pass: true
eval:
  external_evaluators: BLEU
  steps: 1000000000

This config.yml is for the small training batch size (with good performance). For the large training batch size (with poor performance), I remove the batch_size line from the “train:” section.

The odd behavior persists now, with a week more training. I caused another mesa in the loss curve by temporarily increasing batch_size:
image
There is a corresponding valley in the validation BLEU score.

Any thoughts would be appreciated!

Gerd

You are training with --auto_config which comes with the following parameters for Transformers:

train:
  batch_type: tokens
  effective_batch_size: 25000

As you set the batch size to 111, I’m not sure you considered this value to be the number of tokens. I suggest overriding all batch related parameters to make sure they are consistent. You can consider disabling gradient accumulation for now (set effective_batch_size: null).

If I understand this correctly, you have multiple training files. I think this can explain the issue if the data is not properly shuffled. What happens if you concatenate all training data?

Guillaume,

I think batch_type may be my main problem. I’ll try setting it to examples and see what happens.

I’ll also experiment with disabling gradient accumulation.

For each training run, I randomly draw 50 thousand lines from a 6 million line training set. So the data are shuffled. I believe my SequenceRecord files are too large for me to give all the training data at once.

Thanks for your help,
Gerd