OpenNMT-tf fine-tuning checkpoint failure

nat · April 25, 2019, 3:45pm

I’m unable to start fine-tuning due to a restoring from checkpoint failure. The error I get is:

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double

This looks similar to this issue (https://github.com/OpenNMT/OpenNMT-tf/issues/269) but the discussion has not helped me (error persists whether I do num_gpus 3 or num_gpus 1 or leave the argument out completely, and my environment has not changed.

What I’ve done so far

Trained a general domain English-French Transformer until good, stable performance was reached
Did spm_encode --generate_vocabulary on my new domain English and French files to get the new vocabularies
Tidied these new vocabularies (took first column using cut -f 1, inserted a <blank> as first line of both vocabulary files)
Ran an onmt-update-vocab command:

onmt-update-vocab --model experiments/fr_en_v2/avg/ \
--output_dir=experiments/fr_en_v2/avg/updated \
--src_vocab=enfr.vocab \
--tgt_vocab=enfr.vocab \
--new_src_vocab=new.fr_en.vocab.en \
--new_tgt_vocab=new.fr_en.vocab.fr

Applied previously-trained SentencePiece model to tokenize the new train/dev/test data
trained SentencePiece model to in-domain train, dev and test data e.g. for the training files:

spm_encode --model=enfr.model < new.en_fr.train.en > new.en_fr.train.en.token
spm_encode --model=enfr.model < new.en_fr.train.fr > new.en_fr.train.fr.token

Made a new config file for the fine tuning experiment run:

model_dir: experiments/fr_en_v2/avg/updated

data:
  train_features_file: new.en_fr.train.en.token
  train_labels_file: new.en_fr.train.fr.token
  eval_features_file: new.en_fr.dev.en.token
  eval_labels_file: new.en_fr.dev.fr.token
  source_words_vocabulary: new.fr_en.vocab.en
  target_words_vocabulary: new.fr_en.vocab.fr

train:
  save_checkpoints_steps: 1000

eval:
  eval_delay: 60  # Every min
  save_eval_predictions: True
  external_evaluators: BLEU

infer:
  batch_size: 32

Attempted to kick off the fine tuning run using the command

CUDA_VISIBLE_DEVICES=0,1,2 onmt-main train_and_eval \
                            --model_type Transformer \
                            --config experiments/fr_en_v2/avg/updated/config.yml --auto_config \
                            --num_gpus 3

Full terminal output

The full terminal output and error message, for info, is:

CUDA_VISIBLE_DEVICES=0,1,2 onmt-main train_and_eval \
>                             --model_type Transformer \
>                             --config experiments/fr_en_v2/avg/updated/config.yml --auto_config \
>                             --num_gpus 3
/usr/local/lib/python3.5/dist-packages/opennmt/config.py:139: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  subconfig = yaml.load(config_file.read())
WARNING:tensorflow:You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout.
INFO:tensorflow:Using parameters:
data:
  eval_features_file: new.en_fr.dev.en.token
  eval_labels_file: new.en_fr.dev.fr.token
  source_words_vocabulary: new.fr_en.vocab.en
  target_words_vocabulary: new.fr_en.vocab.fr
  train_features_file: new.en_fr.train.en.token
  train_labels_file: new.en_fr.train.fr.token
eval:
  batch_size: 32
  eval_delay: 60
  exporters: last
  external_evaluators: BLEU
  save_eval_predictions: true
infer:
  batch_size: 32
  bucket_width: 5
model_dir: experiments/fr_en_v2/avg/updated
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: noam_decay_v2
  label_smoothing: 0.1
  learning_rate: 2.0
  length_penalty: 0.6
  optimizer: LazyAdamOptimizer
  optimizer_params:
    beta1: 0.9
    beta2: 0.998
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  bucket_width: 1
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_checkpoints_steps: 500
  save_summary_steps: 100
  train_steps: 500000

INFO:tensorflow:Accumulate gradients of 3 iterations to reach effective batch size of 25000
2019-04-25 15:38:43.476633: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-04-25 15:38:43.577205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076
pciBusID: 0000:01:00.0
totalMemory: 11.92GiB freeMemory: 11.82GiB
2019-04-25 15:38:43.628597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 1 with properties:
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076
pciBusID: 0000:02:00.0
totalMemory: 11.93GiB freeMemory: 11.82GiB
2019-04-25 15:38:43.680793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 2 with properties:
name: GeForce GTX TITAN X major: 5 minor: 2 memoryClockRate(GHz): 1.076
pciBusID: 0000:03:00.0
totalMemory: 11.93GiB freeMemory: 10.80GiB
2019-04-25 15:38:43.681267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2
2019-04-25 15:38:44.591748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-25 15:38:44.591789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2
2019-04-25 15:38:44.591796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y Y
2019-04-25 15:38:44.591802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N Y
2019-04-25 15:38:44.591808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   Y Y N
2019-04-25 15:38:44.593231: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 11435 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2)
2019-04-25 15:38:44.593524: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 11436 MB memory) -> physical GPU (device: 1, name: GeForce GTX TITAN X, pci bus id: 0000:02:00.0, compute capability: 5.2)
2019-04-25 15:38:44.593706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 10447 MB memory) -> physical GPU (device: 2, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0, compute capability: 5.2)
INFO:tensorflow:Using config: {'_master': '', '_experimental_distribute': None, '_keep_checkpoint_max': 8, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_session_config': gpu_options {
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    layout_optimizer: OFF
  }
}
, '_service': None, '_save_checkpoints_secs': None, '_save_summary_steps': 100, '_tf_random_seed': None, '_device_fn': None, '_task_type': 'worker', '_task_id': 0, '_num_worker_replicas': 1, '_log_step_count_steps': 300, '_eval_distribute': None, '_train_distribute': None, '_evaluation_master': '', '_model_dir': 'experiments/fr_en_v2/avg/updated', '_protocol': None, '_num_ps_replicas': 0, '_save_checkpoints_steps': 500, '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7d9e0e2c50>}
INFO:tensorflow:Not using Distribute Coordinator.
INFO:tensorflow:Running training and evaluation locally (non-distributed).
INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 500 or save_checkpoints_secs None.
INFO:tensorflow:Training on 1315535 examples
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Number of trainable parameters: 77511625
INFO:tensorflow:Graph was finalized.
2019-04-25 15:39:29.283239: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2
2019-04-25 15:39:29.283335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-25 15:39:29.283349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 1 2
2019-04-25 15:39:29.283356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N Y Y
2019-04-25 15:39:29.283363: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1:   Y N Y
2019-04-25 15:39:29.283369: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2:   Y Y N
2019-04-25 15:39:29.284862: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11435 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:01:00.0, compute capability: 5.2)
2019-04-25 15:39:29.285063: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11436 MB memory) -> physical GPU (device: 1, name: GeForce GTX TITAN X, pci bus id: 0000:02:00.0, compute capability: 5.2)
2019-04-25 15:39:29.285389: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10447 MB memory) -> physical GPU (device: 2, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0, compute capability: 5.2)
INFO:tensorflow:Restoring parameters from experiments/fr_en_v2/avg/updated/model.ckpt-270000
2019-04-25 15:39:30.753455: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Invalid argument: tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
	 [[{{node save/RestoreV2_1}} = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1546, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
	 [[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/opennmt/runner.py:297)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

Caused by op 'save/RestoreV2_1', defined at:
  File "/usr/local/bin/onmt-main", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/usr/local/lib/python3.5/dist-packages/opennmt/runner.py", line 297, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
    self._saver.build()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 789, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 459, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 406, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 862, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
	 [[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/opennmt/runner.py:297)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/onmt-main", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/usr/local/lib/python3.5/dist-packages/opennmt/runner.py", line 297, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 566, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/session_manager.py", line 288, in prepare_session
    config=config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/session_manager.py", line 218, in _restore_checkpoint
    saver.restore(sess, ckpt.model_checkpoint_path)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1582, in restore
    err, "a mismatch between the current graph and the graph")
tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
	 [[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/opennmt/runner.py:297)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

Caused by op 'save/RestoreV2_1', defined at:
  File "/usr/local/bin/onmt-main", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.5/dist-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/usr/local/lib/python3.5/dist-packages/opennmt/runner.py", line 297, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 354, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1207, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1241, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1468, in _train_with_estimator_spec
    log_step_count_steps=log_step_count_steps) as mon_sess:
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
    self._saver.build()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 789, in _build_internal
    restore_sequentially, reshape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 459, in _AddShardedRestoreOps
    name="restore_shard"))
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 406, in _AddRestoreOps
    restore_sequentially)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 862, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2
    shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
    op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1770, in __init__
    self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

tensor_name = optim/beta1_power; expected dtype float does not equal original dtype double
tensor_name = optim/beta2_power; expected dtype float does not equal original dtype double
	 [[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/opennmt/runner.py:297)  = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2_1/tensor_names, save/RestoreV2_1/shape_and_slices)]]

Thanks, any advice you might have would be greatly appreciated.

Cheers,
Natasha

guillaumekln · April 28, 2019, 8:51pm

Hi,

The procedure looks correct. Are you using the latest version of OpenNMT-tf?

nat · April 29, 2019, 8:47am

Thanks,

I’m using quite a recent version, do you think that might be the problem?

OpenNMT-tf==1.21.6
numpy==1.15.4
tensorflow-gpu==1.12.0

I’m running Ubuntu 16.04.5 LTS in a Docker container built from tensorflow:latest-gpu-py3 image.

Full pip freeze output:

absl-py==0.6.1
astor==0.7.1
backcall==0.1.0
bleach==3.0.2
cycler==0.10.0
decorator==4.3.0
defusedxml==0.5.0
entrypoints==0.2.3
gast==0.2.0
grpcio==1.16.0
h5py==2.8.0
ipykernel==5.1.0
ipython==7.1.1
ipython-genutils==0.2.0
ipywidgets==7.4.2
jedi==0.13.1
Jinja2==2.10
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==6.0.0
jupyter-core==4.4.0
Keras-Applications==1.0.6
Keras-Preprocessing==1.0.5
kiwisolver==1.0.1
Markdown==3.0.1
MarkupSafe==1.1.0
matplotlib==3.0.1
mistune==0.8.4
nbconvert==5.4.0
nbformat==4.4.0
notebook==5.7.0
numpy==1.15.4
OpenNMT-tf==1.21.6
pandas==0.23.4
pandocfilters==1.4.2
parso==0.3.1
pexpect==4.6.0
pickleshare==0.7.5
Pillow==5.3.0
prometheus-client==0.4.2
prompt-toolkit==2.0.7
protobuf==3.6.1
ptyprocess==0.6.0
pycurl==7.43.0
Pygments==2.2.0
pygobject==3.20.0
pyonmttok==1.11.0
pyparsing==2.3.0
python-apt==1.1.0b1+ubuntu0.16.4.2
python-dateutil==2.7.5
pytz==2018.7
PyYAML==5.1
pyzmq==17.1.2
qtconsole==4.4.2
rouge==0.3.1
sacrebleu==1.2.20
scikit-learn==0.20.0
scipy==1.1.0
Send2Trash==1.5.0
six==1.11.0
sklearn==0.0
tensorboard==1.12.0
tensorflow-gpu==1.12.0
termcolor==1.1.0
terminado==0.8.1
testpath==0.4.2
tornado==5.1.1
traitlets==4.3.2
typing==3.6.6
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.14.1
widgetsnbextension==3.4.2

Thanks in advance

guillaumekln · April 29, 2019, 7:54pm

Everything looks in order. I will need to check the code again. It’s not the first time this float vs double issue is reported.

nat · May 1, 2019, 2:30pm

Thanks Guillaume for looking into it, it would be great to know what’s happening in there.

Meanwhile, as a kind of hack to get domain adaptation to work, do you think this could work:

Train a general-domain model with:

Vocabulary derived from BOTH the general domain data and new domain-specific data, so the model has knowledge of all possible vocabulary
Use training data that is JUST the general-domain data, until performance is good

And finally, as a hacky adaptation step, continue training on JUST the domain-specific data.

I might try this workaround since this checkpoint reload failure seems related to vocabulary updates (I can continue training from checkpoints if nothing changes), does this sound somewhat sensible?

guillaumekln · May 16, 2019, 4:55pm

For reference, the issue appears when updating the vocabulary of an averaged checkpoint. It’s a bug but in the meantime just update a non averaged checkpoint, continue training, and then average.

mayub · August 4, 2019, 2:04am

@guillaumekln I think this bug is still exists ? I ran into this yesterday, after searching for a while ended up here.

guillaumekln · August 5, 2019, 7:45am

Yes, this hasn’t changed.