Please see the following log. It seems to me it always happens during “:Restoring parameters from …checkpoint”
INFO:tensorflow:SavedModel written to: Model/Transformer/2002756/export/latest/temp-b’1544710208’/saved_model.pb
INFO:tensorflow:global_step/sec: 0.40981
INFO:tensorflow:loss = 9.805884, step = 40 (24.402 sec)
INFO:tensorflow:words_per_sec/features: 810.6
INFO:tensorflow:words_per_sec/labels: 725.114
INFO:tensorflow:Saving checkpoints for 50 into Model/Transformer/2002756/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-13-14:10:34
INFO:tensorflow:Graph was finalized.
2018-12-13 15:10:35.134104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-13 15:10:35.134148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-13 15:10:35.134155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-13 15:10:35.134160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-13 15:10:35.134283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9979 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from Model/Transformer/2002756/model.ckpt-50
2018-12-13 15:10:35.673487: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Data loss: Checksum does not match: stored 4233023548 vs. calculated on the restored bytes 3211831201
Traceback (most recent call last):
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1334, in _do_call
return fn(*args)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 4233023548 vs. calculated on the restored bytes 3211831201
[[{{node save/RestoreV2}} = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[{{node save/RestoreV2/_393}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name=“edge_397_save/RestoreV2”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/home/XXX/miniconda3/envs/ML-DL/bin/onmt-main”, line 11, in
sys.exit(main())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/bin/main.py”, line 161, in main
runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/runner.py”, line 227, in train_and_evaluate
tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 471, in train_and_evaluate
return executor.run()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 610, in run
return self.run_local()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 711, in run_local
saving_listeners=saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1241, in _train_model_default
saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 671, in run
run_metadata=run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1156, in run
run_metadata=run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1255, in run
raise six.reraise(*original_exc_info)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/six.py”, line 693, in reraise
raise value
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1240, in run
return self._sess.run(*args, **kwargs)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1320, in run
run_metadata=run_metadata))
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py”, line 582, in after_run
if self._save(run_context.session, global_step):
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py”, line 607, in _save
if l.after_save(session, step):
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 517, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 537, in _evaluate
self._evaluator.evaluate_and_export())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 912, in evaluate_and_export
hooks=self._eval_spec.hooks)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 478, in evaluate
return _evaluate()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 467, in _evaluate
output_dir=self.eval_dir(name))
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1591, in _evaluate_run
config=self._session_config)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/evaluation.py”, line 271, in _evaluate_once
session_creator=session_creator, hooks=hooks) as session:
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1107, in init
_WrappedSession.init(self, self._create_session())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1112, in _create_session
return self._sess_creator.create_session()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 566, in create_session
init_fn=self._scaffold.init_fn)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py”, line 288, in prepare_session
config=config)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py”, line 202, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1546, in restore
{self.saver_def.filename_tensor_name: save_path})
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 929, in run
run_metadata_ptr)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1152, in _run
feed_dict_tensor, options, run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1328, in _do_run
run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 4233023548 vs. calculated on the restored bytes 3211831201
[[node save/RestoreV2 (defined at /home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/runner.py:227) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[{{node save/RestoreV2/_393}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name=“edge_397_save/RestoreV2”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Caused by op ‘save/RestoreV2’, defined at:
File “/home/XXX/miniconda3/envs/ML-DL/bin/onmt-main”, line 11, in
sys.exit(main())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/bin/main.py”, line 161, in main
runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/runner.py”, line 227, in train_and_evaluate
tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 471, in train_and_evaluate
return executor.run()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 610, in run
return self.run_local()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 711, in run_local
saving_listeners=saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1241, in _train_model_default
saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 671, in run
run_metadata=run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1156, in run
run_metadata=run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1240, in run
return self._sess.run(*args, **kwargs)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1320, in run
run_metadata=run_metadata))
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py”, line 582, in after_run
if self._save(run_context.session, global_step):
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py”, line 607, in _save
if l.after_save(session, step):
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 517, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 537, in _evaluate
self._evaluator.evaluate_and_export())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 912, in evaluate_and_export
hooks=self._eval_spec.hooks)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 478, in evaluate
return _evaluate()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 467, in _evaluate
output_dir=self.eval_dir(name))
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1591, in _evaluate_run
config=self._session_config)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/evaluation.py”, line 271, in _evaluate_once
session_creator=session_creator, hooks=hooks) as session:
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1107, in init
_WrappedSession.init(self, self._create_session())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1112, in _create_session
return self._sess_creator.create_session()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 557, in create_session
self._scaffold.finalize()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 213, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 886, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1102, in init
self.build()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1114, in build
self._build(self._filename, build_save=True, build_restore=True)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1151, in _build
build_save=build_save, build_restore=build_restore)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 789, in _build_internal
restore_sequentially, reshape)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 459, in _AddShardedRestoreOps
name=“restore_shard”))
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 406, in _AddRestoreOps
restore_sequentially)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 862, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py”, line 1466, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py”, line 787, in _apply_op_helper
op_def=op_def)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py”, line 488, in new_func
return func(*args, **kwargs)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 3274, in create_op
op_def=op_def)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 1770, in init
self._traceback = tf_stack.extract_stack()
DataLossError (see above for traceback): Checksum does not match: stored 4233023548 vs. calculated on the restored bytes 3211831201
[[node save/RestoreV2 (defined at /home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/runner.py:227) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[{{node save/RestoreV2/_393}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name=“edge_397_save/RestoreV2”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]