RuntimeError: Model diverged with loss = NaN

yaren · December 4, 2019, 5:15am

Hi, I use opennmt-tf 2.2.1

data:
  eval_features_file: v.fr
  eval_labels_file: v.en
  source_vocabulary: fr.vocab
  target_vocabulary: en.vocab
  train_alignments: all.fr.en.corpus.align.shuf
  train_features_file: all.fr.punk.tok.case.bpe.shuf
  train_labels_file: all.en.punk.tok.case.bpe.shuf
eval:
  batch_size: 16
  eval_delay: 5400
  exporters: last
  external_evaluators: BLEU
  save_eval_predictions: true
infer:
  batch_size: 64
  with_alignments: hard
model_dir: run2
params:
  average_loss_in_time: true
  beam_width: 5
  decay_params:
    model_dim: 1024
    warmup_steps: 80000
  gradients_accum: 1
  guided_alignment_type: ce
  guided_alignment_weight: 1
  replace_unknown_target: true
score:
  batch_size: 16
train:
  average_last_checkpoints: 5
  batch_size: 3072
  batch_type: tokens
  bucket_width: 5
  effective_batch_size: 25600
  keep_checkpoint_max: 50
  maximum_features_length: 100
  maximum_labels_length: 100
  num_threads: 8
  save_checkpoints_secs: 5000
  save_summary_steps: 250
  train_steps: null
  single_pass: true
  max_step: 2000000

but this error happened:

INFO:tensorflow:Step = 502500 ; source words/s = 20894, target words/s = 18818 ; Learning rate = 0.000088 ; Loss = 2.537182
Traceback (most recent call last):
File “/root/pyenv2/bin/onmt-main”, line 8, in
sys.exit(main())
File “/root/pyenv2/lib/python3.5/site-packages/opennmt/bin/main.py”, line 189, in main
checkpoint_path=args.checkpoint_path)
File “/root/pyenv2/lib/python3.5/site-packages/opennmt/runner.py”, line 196, in train
export_on_best=eval_config.get(“export_on_best”))
File “/root/pyenv2/lib/python3.5/site-packages/opennmt/training.py”, line 207, in call
break
File “/usr/lib/python3.5/contextlib.py”, line 77, in exit
self.gen.throw(type, value, traceback)
File “/root/pyenv2/lib/python3.5/site-packages/tensorflow_core/python/ops/summary_ops_v2.py”, line 236, in as_default
yield self
File “/root/pyenv2/lib/python3.5/site-packages/opennmt/training.py”, line 182, in call
raise RuntimeError(“Model diverged with loss = NaN.”)
RuntimeError: Model diverged with loss = NaN.

and I should describle this problem:

I tried with two servers, which with 2 V100-32G GPU Ram. and it’s inevitable all the epochs. Even I tried decrease batch_size and effective_batch_size.
I tried with two servers, with 2V100-16G GPU Ram. And it’s ok.(with small batch_size)
I found it’s happened when save the last checkpoint; at the end of every epoch.

guillaumekln · December 4, 2019, 7:11am

Hi,

Can you post the command line that you used for training?

yaren · December 4, 2019, 9:19am

I used this command:

CUDA_VISIBLE_DEVICES=0,1 onmt-main --model_type Transformer --config con.yml --auto_config train --with_eval --num_gpus 2 1>>rr01.log 2>&1

guillaumekln · December 4, 2019, 9:31am

How did you find that exactly?

As you run epoch-by-epoch training, this could be related to the handling of the last batch. To validate that you can try training on a single GPU or disable single_pass and see if you also get the error.

yaren · December 4, 2019, 10:02am

Sorrry I can’t differentiate batch and epoch about this in English. I’ll try to describe it this way.

I used:

single_pass: true

and my shell like:

CUDA_VISIBLE_DEVICES=0,1 onmt-main --model_type Transformer --config con.yml --auto_config train --with_eval --num_gpus 2 1>>rr01.log 2>&1
CUDA_VISIBLE_DEVICES=0,1 onmt-main --model_type Transformer --config con.yml --auto_config train --with_eval --num_gpus 2 1>>rr02.log 2>&1
…

and it happened at the end of rr01.log/rr02.log/rr03.log…
So the folder

run/avg

is not created at all. every single_pass end with that error.

guillaumekln · December 4, 2019, 11:23am

Does the error happen at the same step each time?
Did you train the same way to reach the step 502500?

yaren · December 4, 2019, 11:29am

rr04.log last step info:

INFO:tensorflow:Step = 277250 ; source words/s = 13904, target words/s = 12526 ; Learning rate = 0.000119 ; Loss = 2.570395
Traceback (most recent call last):…

rr05:

INFO:tensorflow:Step = 352500 ; source words/s = 20875, target words/s = 18804 ; Learning rate = 0.000105 ; Loss = 2.581400

rr06:

INFO:tensorflow:Step = 427500 ; source words/s = 20229, target words/s = 18215 ; Learning rate = 0.000096 ; Loss = 2.600580

rr07:

INFO:tensorflow:Step = 502500 ; source words/s = 20894, target words/s = 18818 ; Learning rate = 0.000088 ; Loss = 2.537182

rr08:

INFO:tensorflow:Step = 577500 ; source words/s = 21171, target words/s = 19079 ; Learning rate = 0.000082 ; Loss = 2.580826

I’s more than 502500 steps.

This only happened in nvidia v100 with 32GB GPU Ram.

And Let me add a description of how I did it:

I first training en->fr; and with V100(16GB GPU Ram); And it’s ok without any problem.(batch_size: 3072; effective_batch_size: 25600)

Then I trainning fr->en; but with V100(32GB GPU Ram);
Because the Ram is bigger, So I use(batch_size: 7680;effective_batch_size: 30720) to start the trainning. When the few batchs finished with the error, I reduce the batch_size to 5120, and to 3072(meanwhite reduce effective_batch_size to 25600). But all the batchs end with this error.
So I suspect there is something wrong from the first batch.

BTW:
I tried in two servers with v100(32G), and all end with this error.
And all these servers I mentioned is with 2GPUs (V100 x 2 x 16GB) (V100 x 2 x 32GB)

guillaumekln · December 4, 2019, 3:31pm

Thanks for all the info. It looks like an issue we solved a few weeks ago and released in 2.2.1:

What happened was at the end of the training on a finite dataset, one or more GPUs could receive an empty batch and produce a NaN loss.

I’m not sure what conditions could still trigger the error.

yaren · December 5, 2019, 1:55am

I have uploaded the two logs:

Download logs.zip from upload.run <== There is no password for the zip file.

one is end with the error, the other one is ok with single_pass. They are from two different servers.
Hoping to find some clues.

guillaumekln · December 5, 2019, 12:03pm

This website does not work. Can you upload the logs elsewhere?

yaren · December 5, 2019, 2:13pm

https://tmpfiles.org/dl/33400/logs.zip

I really appreciate your patient help

steremma · February 21, 2020, 9:51am

I am having a very similar issue, albeit with a different configuration and ONMT version (ONMT 1.25.1, tf 1.14). I am trying to fine-tune a model using in-domain data. For that I use update_vocab since my BPE now has about 5 extra tokens, and then launch the training on a single GPU. I diverge on the very first step, which is weird as the model is already trained to a good extend and its learning rate starts at 1.24 * 10^-4. The exact same issue when I attempt 4 GPUs. My config does not override any param.

The weird thing is that I am replicating the exact same process across languages and about half my models work whereas the other half diverge.

Here is an extract of my logs (I do not include deprecation warnings):

WARNING:tensorflow:You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout.

INFO:tensorflow:Using parameters:
data:
  eval_features_file: /mnt/tmpfs/ende-mdt/valid.en.tok
  eval_labels_file: /mnt/tmpfs/ende-mdt/valid.de.tok
  source_words_vocabulary: /mnt/tmpfs/ende-mdt/shared.bpe
  target_words_vocabulary: /mnt/tmpfs/ende-mdt/shared.bpe
  train_features_file: /mnt/tmpfs/ende-mdt/train.en.tok
  train_labels_file: /mnt/tmpfs/ende-mdt/train.de.tok
eval:
  batch_size: 32
  eval_delay: 70
  exporters: last
  external_evaluators: BLEU
infer:
  batch_size: 32
  bucket_width: 5
model_dir: /mnt/work/onmt-tf/booking_ft/model_ende
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: noam_decay_v2
  label_smoothing: 0.1
  learning_rate: 2.0
  optimizer: LazyAdamOptimizer
  optimizer_params:
    beta1: 0.9
    beta2: 0.998
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  bucket_width: 1
  effective_batch_size: 25000
  keep_checkpoint_max: 500
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_checkpoints_steps: 500
  save_summary_steps: 100
  train_steps: 800000

INFO:tensorflow:Accumulate gradients of 9 iterations to reach effective batch size of 25000
INFO:tensorflow:Training on 5289301 examples
INFO:tensorflow:Restoring parameters from /mnt/work/onmt-tf/model_ende/model.ckpt-520000
INFO:tensorflow:Saving checkpoints for 520000 into /mnt/work/onmt-tf/booking_ft/model_ende/model.ckpt.

2020-02-21 09:03:13.836095: I tensorflow/core/kernels/lookup_util.cc:376] Table trying to initialize from file /mnt/tmpfs/ende-mdt/shared.bpe is already initialized.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 520000 into /mnt/work/onmt-tf/booking_ft/model_ende/model.ckpt.
2020-02-21 09:04:30.900494: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-02-21 09:04:41.208100: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 1262352 of 5289301
2020-02-21 09:04:51.733180: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 2541916 of 5289301
2020-02-21 09:05:01.488317: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 3712250 of 5289301
2020-02-21 09:05:11.239444: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 4830157 of 5289301
2020-02-21 09:05:14.674618: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:162] Shuffle buffer filled.
INFO:tensorflow:loss = 2.5856829, step = 520000
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "/mnt/work/home/estergiadis/tf1.14/bin/onmt-main", line 10, in <module>
    sys.exit(main())
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/opennmt/runner.py", line 301, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
    return self.run_local()
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
    saving_listeners=saving_listeners)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
    saving_listeners)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1484, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
    run_metadata=run_metadata)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
    raise six.reraise(*original_exc_info)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
    return self._sess.run(*args, **kwargs)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1419, in run
    run_metadata=run_metadata))
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 761, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.

guillaumekln · February 21, 2020, 9:55am

You should try with 1.25.2. It reverts a change related to vocabulary weights update that caused NaN loss issues in some case.

steremma · February 21, 2020, 11:02am

you are a life saver