I am having a very similar issue, albeit with a different configuration and ONMT version (ONMT 1.25.1, tf 1.14
). I am trying to fine-tune a model using in-domain data. For that I use update_vocab
since my BPE now has about 5 extra tokens, and then launch the training on a single GPU. I diverge on the very first step, which is weird as the model is already trained to a good extend and its learning rate starts at 1.24 * 10^-4. The exact same issue when I attempt 4 GPUs. My config does not override any param
.
The weird thing is that I am replicating the exact same process across languages and about half my models work whereas the other half diverge.
Here is an extract of my logs (I do not include deprecation warnings):
WARNING:tensorflow:You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout.
INFO:tensorflow:Using parameters:
data:
eval_features_file: /mnt/tmpfs/ende-mdt/valid.en.tok
eval_labels_file: /mnt/tmpfs/ende-mdt/valid.de.tok
source_words_vocabulary: /mnt/tmpfs/ende-mdt/shared.bpe
target_words_vocabulary: /mnt/tmpfs/ende-mdt/shared.bpe
train_features_file: /mnt/tmpfs/ende-mdt/train.en.tok
train_labels_file: /mnt/tmpfs/ende-mdt/train.de.tok
eval:
batch_size: 32
eval_delay: 70
exporters: last
external_evaluators: BLEU
infer:
batch_size: 32
bucket_width: 5
model_dir: /mnt/work/onmt-tf/booking_ft/model_ende
params:
average_loss_in_time: true
beam_width: 4
decay_params:
model_dim: 512
warmup_steps: 8000
decay_type: noam_decay_v2
label_smoothing: 0.1
learning_rate: 2.0
optimizer: LazyAdamOptimizer
optimizer_params:
beta1: 0.9
beta2: 0.998
score:
batch_size: 64
train:
average_last_checkpoints: 8
batch_size: 3072
batch_type: tokens
bucket_width: 1
effective_batch_size: 25000
keep_checkpoint_max: 500
maximum_features_length: 100
maximum_labels_length: 100
sample_buffer_size: -1
save_checkpoints_steps: 500
save_summary_steps: 100
train_steps: 800000
INFO:tensorflow:Accumulate gradients of 9 iterations to reach effective batch size of 25000
INFO:tensorflow:Training on 5289301 examples
INFO:tensorflow:Restoring parameters from /mnt/work/onmt-tf/model_ende/model.ckpt-520000
INFO:tensorflow:Saving checkpoints for 520000 into /mnt/work/onmt-tf/booking_ft/model_ende/model.ckpt.
2020-02-21 09:03:13.836095: I tensorflow/core/kernels/lookup_util.cc:376] Table trying to initialize from file /mnt/tmpfs/ende-mdt/shared.bpe is already initialized.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 520000 into /mnt/work/onmt-tf/booking_ft/model_ende/model.ckpt.
2020-02-21 09:04:30.900494: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-02-21 09:04:41.208100: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 1262352 of 5289301
2020-02-21 09:04:51.733180: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 2541916 of 5289301
2020-02-21 09:05:01.488317: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 3712250 of 5289301
2020-02-21 09:05:11.239444: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:111] Filling up shuffle buffer (this may take a while): 4830157 of 5289301
2020-02-21 09:05:14.674618: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:162] Shuffle buffer filled.
INFO:tensorflow:loss = 2.5856829, step = 520000
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
WARNING:tensorflow:It seems that global step (tf.train.get_global_step) has not been increased. Current value (could be stable): 520000 vs previous value: 520000. You could increase the global step by passing tf.train.get_global_step() to Optimizer.apply_gradients or Opt$
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback (most recent call last):
File "/mnt/work/home/estergiadis/tf1.14/bin/onmt-main", line 10, in <module>
sys.exit(main())
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/opennmt/bin/main.py", line 172, in main
runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/opennmt/runner.py", line 301, in train_and_evaluate
result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 367, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1192, in _train_model_default
saving_listeners)
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1484, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
run_metadata=run_metadata)
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
raise six.reraise(*original_exc_info)
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
return self._sess.run(*args, **kwargs)
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1419, in run
run_metadata=run_metadata))
File "/mnt/work/home/estergiadis/tf1.14/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 761, in after_run
raise NanLossDuringTrainingError
tensorflow.python.training.basic_session_run_hooks.NanLossDuringTrainingError: NaN loss during training.