OpenNMT Forum

DataLossError : Checksum does not match

opennmt-tf

#1

Under Ubuntu 16.04, with Anaconda python 3.6, the training always failed with " DataLossError : Checksum does not match" after a (variable) number of steps.

I found related issues, in particular on this post https://github.com/tensorflow/tensorflow/issues/13463. Clean system memory cache is not the solution. The issue seems to have relation with compression.

It’s very frustrating to have opennmt out there but can’t use it. I know this is not opennmt issue, but, as for now, do we have already a solution or a workaround…

By the way, on my Windows 10, everything is fine…very ironically…(the only problem I don’t have enough GPU memory on my Windows 10 machine)


(Guillaume Klein) #2

So you are training with non text inputs, right?

It is not a compression issue as OpenNMT-tf does not use that.

The issue is memory corruption. To mitigate the issue, you can set sample_buffer_size to a lower value so that records are staying less time in memory, reducing the likelihood of a corruption.


#3

Hi, I am training with tokenized text inputs (translation task).

I tried sample_buffer_size to 0 or 1, and I still got the same DataLossError (Checksum) issue…

I am not sure what to do next…I saw on tensorflow site there is one guy said he ran memtest and found fautly memory and he solved the issue by change that memory card…


(Guillaume Klein) #4

Could you share the full error log?

I think there is indeed something wrong with your system. We never got this issue on text data over hundreds of trainings.


#5

Please see the following log. It seems to me it always happens during “:Restoring parameters from …checkpoint”

INFO:tensorflow:SavedModel written to: Model/Transformer/2002756/export/latest/temp-b’1544710208’/saved_model.pb
INFO:tensorflow:global_step/sec: 0.40981
INFO:tensorflow:loss = 9.805884, step = 40 (24.402 sec)
INFO:tensorflow:words_per_sec/features: 810.6
INFO:tensorflow:words_per_sec/labels: 725.114
INFO:tensorflow:Saving checkpoints for 50 into Model/Transformer/2002756/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-13-14:10:34
INFO:tensorflow:Graph was finalized.
2018-12-13 15:10:35.134104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-12-13 15:10:35.134148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-13 15:10:35.134155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-13 15:10:35.134160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-13 15:10:35.134283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9979 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:41:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from Model/Transformer/2002756/model.ckpt-50
2018-12-13 15:10:35.673487: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Data loss: Checksum does not match: stored 4233023548 vs. calculated on the restored bytes 3211831201
Traceback (most recent call last):
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1334, in _do_call
return fn(*args)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 4233023548 vs. calculated on the restored bytes 3211831201
[[{{node save/RestoreV2}} = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[{{node save/RestoreV2/_393}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name=“edge_397_save/RestoreV2”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/XXX/miniconda3/envs/ML-DL/bin/onmt-main”, line 11, in
sys.exit(main())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/bin/main.py”, line 161, in main
runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/runner.py”, line 227, in train_and_evaluate
tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 471, in train_and_evaluate
return executor.run()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 610, in run
return self.run_local()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 711, in run_local
saving_listeners=saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1241, in _train_model_default
saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 671, in run
run_metadata=run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1156, in run
run_metadata=run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1255, in run
raise six.reraise(*original_exc_info)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/six.py”, line 693, in reraise
raise value
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1240, in run
return self._sess.run(*args, **kwargs)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1320, in run
run_metadata=run_metadata))
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py”, line 582, in after_run
if self._save(run_context.session, global_step):
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py”, line 607, in _save
if l.after_save(session, step):
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 517, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 537, in _evaluate
self._evaluator.evaluate_and_export())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 912, in evaluate_and_export
hooks=self._eval_spec.hooks)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 478, in evaluate
return _evaluate()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 467, in _evaluate
output_dir=self.eval_dir(name))
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1591, in _evaluate_run
config=self._session_config)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/evaluation.py”, line 271, in _evaluate_once
session_creator=session_creator, hooks=hooks) as session:
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1107, in init
_WrappedSession.init(self, self._create_session())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1112, in _create_session
return self._sess_creator.create_session()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 566, in create_session
init_fn=self._scaffold.init_fn)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py”, line 288, in prepare_session
config=config)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/session_manager.py”, line 202, in _restore_checkpoint
saver.restore(sess, checkpoint_filename_with_path)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1546, in restore
{self.saver_def.filename_tensor_name: save_path})
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 929, in run
run_metadata_ptr)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1152, in _run
feed_dict_tensor, options, run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1328, in _do_run
run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/client/session.py”, line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 4233023548 vs. calculated on the restored bytes 3211831201
[[node save/RestoreV2 (defined at /home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/runner.py:227) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[{{node save/RestoreV2/_393}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name=“edge_397_save/RestoreV2”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op ‘save/RestoreV2’, defined at:
File “/home/XXX/miniconda3/envs/ML-DL/bin/onmt-main”, line 11, in
sys.exit(main())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/bin/main.py”, line 161, in main
runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/runner.py”, line 227, in train_and_evaluate
tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 471, in train_and_evaluate
return executor.run()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 610, in run
return self.run_local()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 711, in run_local
saving_listeners=saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 354, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1207, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1241, in _train_model_default
saving_listeners)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1471, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 671, in run
run_metadata=run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1156, in run
run_metadata=run_metadata)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1240, in run
return self._sess.run(*args, **kwargs)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1320, in run
run_metadata=run_metadata))
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py”, line 582, in after_run
if self._save(run_context.session, global_step):
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py”, line 607, in _save
if l.after_save(session, step):
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 517, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 537, in _evaluate
self._evaluator.evaluate_and_export())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 912, in evaluate_and_export
hooks=self._eval_spec.hooks)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 478, in evaluate
return _evaluate()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 467, in _evaluate
output_dir=self.eval_dir(name))
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1591, in _evaluate_run
config=self._session_config)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/evaluation.py”, line 271, in _evaluate_once
session_creator=session_creator, hooks=hooks) as session:
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 921, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 643, in init
self._sess = _RecoverableSession(self._coordinated_creator)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1107, in init
_WrappedSession.init(self, self._create_session())
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 1112, in _create_session
return self._sess_creator.create_session()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 800, in create_session
self.tf_sess = self._session_creator.create_session()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 557, in create_session
self._scaffold.finalize()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 213, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 886, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1102, in init
self.build()
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1114, in build
self._build(self._filename, build_save=True, build_restore=True)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 1151, in _build
build_save=build_save, build_restore=build_restore)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 789, in _build_internal
restore_sequentially, reshape)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 459, in _AddShardedRestoreOps
name=“restore_shard”))
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 406, in _AddRestoreOps
restore_sequentially)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/training/saver.py”, line 862, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/ops/gen_io_ops.py”, line 1466, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py”, line 787, in _apply_op_helper
op_def=op_def)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py”, line 488, in new_func
return func(*args, **kwargs)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 3274, in create_op
op_def=op_def)
File “/home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/tensorflow/python/framework/ops.py”, line 1770, in init
self._traceback = tf_stack.extract_stack()

DataLossError (see above for traceback): Checksum does not match: stored 4233023548 vs. calculated on the restored bytes 3211831201
[[node save/RestoreV2 (defined at /home/XXX/miniconda3/envs/ML-DL/lib/python3.6/site-packages/opennmt/runner.py:227) = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, …, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
[[{{node save/RestoreV2/_393}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name=“edge_397_save/RestoreV2”, tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]


(Guillaume Klein) #6

Do you have enough disk space?


#7

I have 40GB space. I used TransformerBig model, and keep 6 checkpoints and average 5 checkpoints. The space should be enough. (Previously, I tried it on a 1TB space.)

Now, I deleted the previous model folder. I tried keep only 1 checkpoints and without averaging checkpoints. It now runs over 10 times of saving/restoring checkpoint, but failed at the 11th time.


(Guillaume Klein) #8

Did you check the checkpoint size? For example, is the last checkpoint smaller than the others?


#9

The last checkpoint has exactly the same size as the previous checkpoints. And more strangely, when I remove the last checkpoint, and modify the file “checkpoint” (removing the last checkpoint entry), and relaunched the training, I get the same DataLossError/Checksum issue. This doesn’t make any sense …, because the second last checkpoint must have been OK, otherwise it would have failed before…


Update:

Removing the last checkpoint/entry, and after rebooting, the training can continue. Without rebooting, the training can’t continue. I am not sure how far the training can continue after rebooting though…


#10

After reinstalling/updating nvidia driver/cuda/cudnn, the problem seems gone. (Not sure if it is fixed, but it can pass 5000 checkpoints without DataLossError and continue running.)