I am having a very similar issue, albeit with a different configuration and ONMT version (ONMT 1.25.1, tf 1.14). I am trying to fine-tune a model using in-domain data. For that I use update_vocab since my BPE now has about 5 extra tokens, and then launch the training on a single GPU. I diverge on the very first step, which is weird as the model is already trained to a good extend and its learning rate starts at 1.24 * 10^-4. The exact same issue when I attempt 4 GPUs. My config does not override any param.
The weird thing is that I am replicating the exact same process across languages and about half my models work whereas the other half diverge.
Here is an extract of my logs (I do not include deprecation warnings):
WARNING:tensorflow:You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout.
INFO:tensorflow:Using parameters:
data:
  eval_features_file: /mnt/tmpfs/ende-mdt/valid.en.tok
  eval_labels_file: /mnt/tmpfs/ende-mdt/valid.de.tok
  source_words_vocabulary: /mnt/tmpfs/ende-mdt/shared.bpe
  target_words_vocabulary: /mnt/tmpfs/ende-mdt/shared.bpe
  train_features_file: /mnt/tmpfs/ende-mdt/train.en.tok
  train_labels_file: /mnt/tmpfs/ende-mdt/train.de.tok
eval:
  batch_size: 32
  eval_delay: 70
  exporters: last
  external_evaluators: BLEU
infer:
  batch_size: 32
  bucket_width: 5
model_dir: /mnt/work/onmt-tf/booking_ft/model_ende
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: noam_decay_v2
  label_smoothing: 0.1
  learning_rate: 2.0
  optimizer: LazyAdamOptimizer
  optimizer_params:
    beta1: 0.9
    beta2: 0.998
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  bucket_width: 1
  effective_batch_size: 25000
  keep_checkpoint_max: 500
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_checkpoints_steps: 500
  save_summary_steps: 100
  train_steps: 800000
INFO:tensorflow:Accumulate gradients of 9 iterations to reach effective batch size of 25000
INFO:tensorflow:Training on 5289301 examples
INFO:tensorflow:Restoring parameters from /mnt/work/onmt-tf/model_ende/model.ckpt-520000
INFO:tensorflow:Saving checkpoints for 520000 into /mnt/work/onmt-tf/booking_ft/model_ende/model.ckpt.
2020-02-21 09:03:13.836095: I tensorflow/core/kernels/lookup_util.cc:376] Table trying to initialize from file /mnt/tmpfs/ende-mdt/shared.bpe is already initialized.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 520000 into /mnt/work/onmt-tf/booking_ft/model_ende/model.ckpt.
2020-02-21 09:04:30.900494: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-02-21 09:04:41.208100: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 1262352 of 5289301
2020-02-21 09:04:51.733180: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 2541916 of 5289301
2020-02-21 09:05:01.488317: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 3712250 of 5289301
2020-02-21 09:05:11.239444: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 4830157 of 5289301
2020-02-21 09:05:14.674618: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:162] Shuffle buffer filled.
INFO:tensorflow:loss = 2.5856829, step = 520000
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
  File "/mnt/work/home/estergiadis/tf1.14/bin/onmt-main", line 10, in <module>
    sys.exit(main())
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/opennmt/runner.py", line 301, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
    return executor.run()
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
    return self.run_local()
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
    saving_listeners=saving_listeners)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
    saving_listeners)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1484, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
    run_metadata=run_metadata)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
    raise six.reraise(*original_exc_info)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
    return self._sess.run(*args, **kwargs)
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1419, in run
    run_metadata=run_metadata))
  File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 761, in after_run
    raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.