OpenNMT Forum

RuntimeError: Model diverged with loss = NaN

Hi, I use opennmt-tf 2.2.1

data:
  eval_features_file: v.fr
  eval_labels_file: v.en
  source_vocabulary: fr.vocab
  target_vocabulary: en.vocab
  train_alignments: all.fr.en.corpus.align.shuf
  train_features_file: all.fr.punk.tok.case.bpe.shuf
  train_labels_file: all.en.punk.tok.case.bpe.shuf
eval:
  batch_size: 16
  eval_delay: 5400
  exporters: last
  external_evaluators: BLEU
  save_eval_predictions: true
infer:
  batch_size: 64
  with_alignments: hard
model_dir: run2
params:
  average_loss_in_time: true
  beam_width: 5
  decay_params:
    model_dim: 1024
    warmup_steps: 80000
  gradients_accum: 1
  guided_alignment_type: ce
  guided_alignment_weight: 1
  replace_unknown_target: true
score:
  batch_size: 16
train:
  average_last_checkpoints: 5
  batch_size: 3072
  batch_type: tokens
  bucket_width: 5
  effective_batch_size: 25600
  keep_checkpoint_max: 50
  maximum_features_length: 100
  maximum_labels_length: 100
  num_threads: 8
  save_checkpoints_secs: 5000
  save_summary_steps: 250
  train_steps: null
  single_pass: true
  max_step: 2000000

but this error happened:

INFO:tensorflow:Step = 502500 ; source words/s = 20894, target words/s = 18818 ; Learning rate = 0.000088 ; Loss = 2.537182
Traceback (most recent call last):
File “/root/pyenv2/bin/onmt-main”, line 8, in
sys.exit(main())
File “/root/pyenv2/lib/python3.5/site-packages/opennmt/bin/main.py”, line 189, in main
checkpoint_path=args.checkpoint_path)
File “/root/pyenv2/lib/python3.5/site-packages/opennmt/runner.py”, line 196, in train
export_on_best=eval_config.get(“export_on_best”))
File “/root/pyenv2/lib/python3.5/site-packages/opennmt/training.py”, line 207, in call
break
File “/usr/lib/python3.5/contextlib.py”, line 77, in exit
self.gen.throw(type, value, traceback)
File “/root/pyenv2/lib/python3.5/site-packages/tensorflow_core/python/ops/summary_ops_v2.py”, line 236, in as_default
yield self
File “/root/pyenv2/lib/python3.5/site-packages/opennmt/training.py”, line 182, in call
raise RuntimeError(“Model diverged with loss = NaN.”)
RuntimeError: Model diverged with loss = NaN.

and I should describle this problem:

  1. I tried with two servers, which with 2 V100-32G GPU Ram. and it’s inevitable all the epochs. Even I tried decrease batch_size and effective_batch_size.
  2. I tried with two servers, with 2V100-16G GPU Ram. And it’s ok.(with small batch_size)
  3. I found it’s happened when save the last checkpoint; at the end of every epoch.

Hi,

Can you post the command line that you used for training?

I used this command:

CUDA_VISIBLE_DEVICES=0,1 onmt-main --model_type Transformer --config con.yml --auto_config train --with_eval --num_gpus 2 1>>rr01.log 2>&1

How did you find that exactly?

As you run epoch-by-epoch training, this could be related to the handling of the last batch. To validate that you can try training on a single GPU or disable single_pass and see if you also get the error.

Sorrry I can’t differentiate batch and epoch about this in English. I’ll try to describe it this way.

I used:

single_pass: true

and my shell like:

CUDA_VISIBLE_DEVICES=0,1 onmt-main --model_type Transformer --config con.yml --auto_config train --with_eval --num_gpus 2 1>>rr01.log 2>&1
CUDA_VISIBLE_DEVICES=0,1 onmt-main --model_type Transformer --config con.yml --auto_config train --with_eval --num_gpus 2 1>>rr02.log 2>&1

and it happened at the end of rr01.log/rr02.log/rr03.log…
So the folder

run/avg

is not created at all. every single_pass end with that error.

  • Does the error happen at the same step each time?
  • Did you train the same way to reach the step 502500?

rr04.log last step info:

INFO:tensorflow:Step = 277250 ; source words/s = 13904, target words/s = 12526 ; Learning rate = 0.000119 ; Loss = 2.570395
Traceback (most recent call last):…

rr05:

INFO:tensorflow:Step = 352500 ; source words/s = 20875, target words/s = 18804 ; Learning rate = 0.000105 ; Loss = 2.581400

rr06:

INFO:tensorflow:Step = 427500 ; source words/s = 20229, target words/s = 18215 ; Learning rate = 0.000096 ; Loss = 2.600580

rr07:

INFO:tensorflow:Step = 502500 ; source words/s = 20894, target words/s = 18818 ; Learning rate = 0.000088 ; Loss = 2.537182

rr08:

INFO:tensorflow:Step = 577500 ; source words/s = 21171, target words/s = 19079 ; Learning rate = 0.000082 ; Loss = 2.580826

I’s more than 502500 steps.

This only happened in nvidia v100 with 32GB GPU Ram.

And Let me add a description of how I did it:

I first training en->fr; and with V100(16GB GPU Ram); And it’s ok without any problem.(batch_size: 3072; effective_batch_size: 25600)

Then I trainning fr->en; but with V100(32GB GPU Ram);
Because the Ram is bigger, So I use(batch_size: 7680;effective_batch_size: 30720) to start the trainning. When the few batchs finished with the error, I reduce the batch_size to 5120, and to 3072(meanwhite reduce effective_batch_size to 25600). But all the batchs end with this error.
So I suspect there is something wrong from the first batch.

BTW:
I tried in two servers with v100(32G), and all end with this error.
And all these servers I mentioned is with 2GPUs (V100 x 2 x 16GB) (V100 x 2 x 32GB)

Thanks for all the info. It looks like an issue we solved a few weeks ago and released in 2.2.1:

What happened was at the end of the training on a finite dataset, one or more GPUs could receive an empty batch and produce a NaN loss.

I’m not sure what conditions could still trigger the error.

I have uploaded the two logs:

Download logs.zip from upload.run <== There is no password for the zip file.

one is end with the error, the other one is ok with single_pass. They are from two different servers.
Hoping to find some clues.

This website does not work. Can you upload the logs elsewhere?

https://tmpfiles.org/dl/33400/logs.zip

I really appreciate your patient help