OpenNMT-tf fine-tuning checkpoint failure

I’m unable to start fine-tuning due to a restoring from checkpoint failure. The error I get is:

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double

This looks similar to this issue (https://github.com/OpenNMT/OpenNMT-tf/issues/269) but the discussion has not helped me (error persists whether I do num_gpus 3 or num_gpus 1 or leave the argument out completely, and my environment has not changed.

What I’ve done so far

  1. Trained a general domain English-French Transformer until good, stable performance was reached
  2. Did spm_encode --generate_vocabulary on my new domain English and French files to get the new vocabularies
  3. Tidied these new vocabularies (took first column using cut -f 1, inserted a <blank> as first line of both vocabulary files)
  4. Ran an onmt-update-vocab command:
onmt-update-vocab --model experiments/fr_en_v2/avg/ \
--output_dir=experiments/fr_en_v2/avg/updated \
--src_vocab=enfr.vocab \
--tgt_vocab=enfr.vocab \
--new_src_vocab=new.fr_en.vocab.en \
--new_tgt_vocab=new.fr_en.vocab.fr
  1. Applied previously-trained SentencePiece model to tokenize the new train/dev/test data
    trained SentencePiece model to in-domain train, dev and test data e.g. for the training files:
spm_encode --model=enfr.model < new.en_fr.train.en > new.en_fr.train.en.token
spm_encode --model=enfr.model < new.en_fr.train.fr > new.en_fr.train.fr.token
  1. Made a new config file for the fine tuning experiment run:
model_dir: experiments/fr_en_v2/avg/updated

data:
  train_features_file: new.en_fr.train.en.token
  train_labels_file: new.en_fr.train.fr.token
  eval_features_file: new.en_fr.dev.en.token
  eval_labels_file: new.en_fr.dev.fr.token
  source_words_vocabulary: new.fr_en.vocab.en
  target_words_vocabulary: new.fr_en.vocab.fr

train:
  save_checkpoints_steps: 1000

eval:
  eval_delay: 60  # Every min
  save_eval_predictions: True
  external_evaluators: BLEU

infer:
  batch_size: 32
  1. Attempted to kick off the fine tuning run using the command
CUDA_VISIBLE_DEVICES=0,1,2 onmt-main train_and_eval \
                            --model_type Transformer \
                            --config experiments/fr_en_v2/avg/updated/config.yml --auto_config \
                            --num_gpus 3

Full terminal output

The full terminal output and error message, for info, is:

CUDA_VISIBLE_DEVICES=0,1,2 onmt-main train_and_eval \
>                             --model_type Transformer \
>                             --config experiments/fr_en_v2/avg/updated/config.yml --auto_config \
>                             --num_gpus 3
/usr/local/lib/python3.5/dist-packages/opennmt/config.py:139: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  subconfig = yaml.load(config_file.read())
WARNING:tensorflow:You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout.
INFO:tensorflow:Using parameters:
data:
  eval_features_file: new.en_fr.dev.en.token
  eval_labels_file: new.en_fr.dev.fr.token
  source_words_vocabulary: new.fr_en.vocab.en
  target_words_vocabulary: new.fr_en.vocab.fr
  train_features_file: new.en_fr.train.en.token
  train_labels_file: new.en_fr.train.fr.token
eval:
  batch_size: 32
  eval_delay: 60
  exporters: last
  external_evaluators: BLEU
  save_eval_predictions: true
infer:
  batch_size: 32
  bucket_width: 5
model_dir: experiments/fr_en_v2/avg/updated
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: noam_decay_v2
  label_smoothing: 0.1
  learning_rate: 2.0
  length_penalty: 0.6
  optimizer: LazyAdamOptimizer
  optimizer_params:
    beta1: 0.9
    beta2: 0.998
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  bucket_width: 1
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_checkpoints_steps: 500
  save_summary_steps: 100
  train_steps: 500000

INFO:tensorflow:Accumulate gradients of 3 iterations to reach effective batch size of 25000
2019-04-25 15:38:43.476633: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-25 15:38:43.577205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076
pciBusID: 0000:01:00.0
totalMemory: 11.92GiB freeMemory: 11.82GiB
2019-04-25 15:38:43.628597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076
pciBusID: 0000:02:00.0
totalMemory: 11.93GiB freeMemory: 11.82GiB
2019-04-25 15:38:43.680793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076
pciBusID: 0000:03:00.0
totalMemory: 11.93GiB freeMemory: 10.80GiB
2019-04-25 15:38:43.681267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2
2019-04-25 15:38:44.591748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-25 15:38:44.591789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2
2019-04-25 15:38:44.591796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y Y
2019-04-25 15:38:44.591802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N Y
2019-04-25 15:38:44.591808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   Y Y N
2019-04-25 15:38:44.593231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 11435 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2)
2019-04-25 15:38:44.593524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 11436 MB memory) -> physical GPU (device: 1, name: GeForce GTX TITAN X, pci bus id: 0000:02:00.0, compute capability: 5.2)
2019-04-25 15:38:44.593706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 10447 MB memory) -> physical GPU (device: 2, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0, compute capability: 5.2)
INFO:tensorflow:Using config: {'_master': '', '_experimental_distribute': None, '_keep_checkpoint_max': 8, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_session_config': gpu_options {
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    layout_optimizer: OFF
  }
}
, '_service': None, '_save_checkpoints_secs': None, '_save_summary_steps': 100, '_tf_random_seed': None, '_device_fn': None, '_task_type': 'worker', '_task_id': 0, '_num_worker_replicas': 1, '_log_step_count_steps': 300, '_eval_distribute': None, '_train_distribute': None, '_evaluation_master': '', '_model_dir': 'experiments/fr_en_v2/avg/updated', '_protocol': None, '_num_ps_replicas': 0, '_save_checkpoints_steps': 500, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7d9e0e2c50>}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 500 or save_checkpoints_secs None.
INFO:tensorflow:Training on 1315535 examples
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Number of trainable parameters: 77511625
INFO:tensorflow:Graph was finalized.
2019-04-25 15:39:29.283239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2
2019-04-25 15:39:29.283335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-25 15:39:29.283349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2
2019-04-25 15:39:29.283356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y Y
2019-04-25 15:39:29.283363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N Y
2019-04-25 15:39:29.283369: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   Y Y N
2019-04-25 15:39:29.284862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11435 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2)
2019-04-25 15:39:29.285063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11436 MB memory) -> physical GPU (device: 1, name: GeForce GTX TITAN X, pci bus id: 0000:02:00.0, compute capability: 5.2)
2019-04-25 15:39:29.285389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10447 MB memory) -> physical GPU (device: 2, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0, compute capability: 5.2)
INFO:tensorflow:Restoring parameters from experiments/fr_en_v2/avg/updated/model.ckpt-270000
2019-04-25 15:39:30.753455: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Invalid argument: tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
	 [[{{node save/RestoreV2_1}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1546, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
	 [[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/opennmt/runner.py:297)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

Caused by op 'save/RestoreV2_1', defined at:
  File "/usr/local/bin/onmt-main", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/usr/local/lib/python3.5/dist-packages/opennmt/runner.py", line 297, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
    self._saver.build()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 789, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 459, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 406, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 862, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
	 [[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/opennmt/runner.py:297)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/onmt-main", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/usr/local/lib/python3.5/dist-packages/opennmt/runner.py", line 297, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/session_manager.py", line 288, in prepare_session
    config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/session_manager.py", line 218, in _restore_checkpoint
    saver.restore(sess, ckpt.model_checkpoint_path)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1582, in restore
    err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
	 [[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/opennmt/runner.py:297)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

Caused by op 'save/RestoreV2_1', defined at:
  File "/usr/local/bin/onmt-main", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/usr/local/lib/python3.5/dist-packages/opennmt/runner.py", line 297, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
    self._saver.build()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 789, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 459, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 406, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 862, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
	 [[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/opennmt/runner.py:297)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

Thanks, any advice you might have would be greatly appreciated.

Cheers,
Natasha

Hi,

The procedure looks correct. Are you using the latest version of OpenNMT-tf?

Thanks,

I’m using quite a recent version, do you think that might be the problem?

OpenNMT-tf==1.21.6
numpy==1.15.4
tensorflow-gpu==1.12.0

I’m running Ubuntu 16.04.5 LTS in a Docker container built from tensorflow:latest-gpu-py3 image.

Full pip freeze output:

absl-py==0.6.1
astor==0.7.1
backcall==0.1.0
bleach==3.0.2
cycler==0.10.0
decorator==4.3.0
defusedxml==0.5.0
entrypoints==0.2.3
gast==0.2.0
grpcio==1.16.0
h5py==2.8.0
ipykernel==5.1.0
ipython==7.1.1
ipython-genutils==0.2.0
ipywidgets==7.4.2
jedi==0.13.1
Jinja2==2.10
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==6.0.0
jupyter-core==4.4.0
Keras-Applications==1.0.6
Keras-Preprocessing==1.0.5
kiwisolver==1.0.1
Markdown==3.0.1
MarkupSafe==1.1.0
matplotlib==3.0.1
mistune==0.8.4
nbconvert==5.4.0
nbformat==4.4.0
notebook==5.7.0
numpy==1.15.4
OpenNMT-tf==1.21.6
pandas==0.23.4
pandocfilters==1.4.2
parso==0.3.1
pexpect==4.6.0
pickleshare==0.7.5
Pillow==5.3.0
prometheus-client==0.4.2
prompt-toolkit==2.0.7
protobuf==3.6.1
ptyprocess==0.6.0
pycurl==7.43.0
Pygments==2.2.0
pygobject==3.20.0
pyonmttok==1.11.0
pyparsing==2.3.0
python-apt==1.1.0b1+ubuntu0.16.4.2
python-dateutil==2.7.5
pytz==2018.7
PyYAML==5.1
pyzmq==17.1.2
qtconsole==4.4.2
rouge==0.3.1
sacrebleu==1.2.20
scikit-learn==0.20.0
scipy==1.1.0
Send2Trash==1.5.0
six==1.11.0
sklearn==0.0
tensorboard==1.12.0
tensorflow-gpu==1.12.0
termcolor==1.1.0
terminado==0.8.1
testpath==0.4.2
tornado==5.1.1
traitlets==4.3.2
typing==3.6.6
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.14.1
widgetsnbextension==3.4.2

Thanks in advance

Everything looks in order. I will need to check the code again. It’s not the first time this float vs double issue is reported.

Thanks Guillaume for looking into it, it would be great to know what’s happening in there.

Meanwhile, as a kind of hack to get domain adaptation to work, do you think this could work:

Train a general-domain model with:

  • Vocabulary derived from BOTH the general domain data and new domain-specific data, so the model has knowledge of all possible vocabulary
  • Use training data that is JUST the general-domain data, until performance is good

And finally, as a hacky adaptation step, continue training on JUST the domain-specific data.

I might try this workaround since this checkpoint reload failure seems related to vocabulary updates (I can continue training from checkpoints if nothing changes), does this sound somewhat sensible?

For reference, the issue appears when updating the vocabulary of an averaged checkpoint. It’s a bug but in the meantime just update a non averaged checkpoint, continue training, and then average.

@guillaumekln I think this bug is still exists ? I ran into this yesterday, after searching for a while ended up here.

Yes, this hasn’t changed.