GPU out of memory: After checkpoint is saved and during first step of eval

spatel · March 21, 2019, 6:10pm

My training data is about 30M sentences, with 32,000 source/target vocab. I’m running this on AWS p2.xlarge with 1GPU of 11GB memory. I have already tokenized data using SentencePiece. As the evaluate starts after every checkpoint as defined in config file save_checkpoints_steps: 2000, training crashes at first step of eval. Any suggestion ?

I’m using following training command:

onmt-main train_and_eval --config parameters.yaml --model_type TransformerAAN --num_gpus 1 --seed 42 --gpu_allow_growth

Here is my parameters.yaml file configuration:

model_dir: es_en_model

data:
  train_features_file: /opt/data/es_en/es_en.train.source
  train_labels_file: /opt/data/es_en/es_en.train.target

  train_alignments: /opt/data/es_en/es_en.forward.align.train

  eval_features_file: /opt/data/es_en/es_en.eval.source
  eval_labels_file: /opt/data/es_en/es_en.eval.target

  source_words_vocabulary: /opt/data/es_en/es_en.es.onmt.vocab
  target_words_vocabulary: /opt/data/es_en/es_en.en.onmt.vocab

params:
  optimizer: AdamOptimizer
  learning_rate: 0.0002
  clip_gradients: 5.0
  regularization:
    type: l2
    scale: 1e-4
  weight_decay: 0.01

  average_loss_in_time: true
  decay_type: exponential_decay
  decay_params:
    decay_rate: 0.7
    decay_steps: 50000
  
  decay_step_duration: 1
  start_decay_steps: 100000

  minimum_learning_rate: 0.0001
  maximum_learning_rate: 1e6

  guided_alignment_type: ce
  guided_alignment_weight: 1

train:
  batch_size: 32
  batch_type: examples
  effective_batch_size: 32
  save_checkpoints_steps: 2000
  save_checkpoints_secs: null
  keep_checkpoint_max: 6
  save_summary_steps: 100
  train_steps: 500000
  single_pass: false
  maximum_features_length: 70
  maximum_labels_length: 70
  bucket_width: 5
  num_threads: 4
  prefetch_buffer_size: null
  sample_buffer_size: 500000
  average_last_checkpoints: 6

eval:
  batch_size: 16
  num_threads: 4
  prefetch_buffer_size: 1
  maximum_features_length: 70
  maximum_labels_length: 70
  eval_delay: 7200
  save_eval_predictions: true
  external_evaluators: sacreBLEU
  exporters: best

Here is the last error log of training:

INFO:tensorflow:loss = 11.1917, step = 5800 (46.043 sec)
INFO:tensorflow:source_words/sec: 1955
INFO:tensorflow:target_words/sec: 1952
INFO:tensorflow:global_step/sec: 2.19816
INFO:tensorflow:loss = 12.558362, step = 5900 (45.493 sec)
INFO:tensorflow:source_words/sec: 1938
INFO:tensorflow:target_words/sec: 1939
INFO:tensorflow:Saving checkpoints for 6000 into es_en_model/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-03-21-17:32:22
INFO:tensorflow:Graph was finalized.
2019-03-21 17:32:22.728300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-03-21 17:32:22.728361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-21 17:32:22.728376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977]      0 
2019-03-21 17:32:22.728382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0:   N 
2019-03-21 17:32:22.728540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10758 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from es_en_model/model.ckpt-6000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2019-03-21 17:35:34.305203: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 26.64GiB.  Current allocation summary follows.

...
...
...
2019-03-21 17:35:34.330541: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 69740544 totalling 66.51MiB
2019-03-21 17:35:34.330559: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 70780928 totalling 67.50MiB
2019-03-21 17:35:34.330579: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 73934848 totalling 70.51MiB
2019-03-21 17:35:34.330598: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 77088768 totalling 73.52MiB
2019-03-21 17:35:34.330617: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 4 Chunks of size 244940800 totalling 934.38MiB
2019-03-21 17:35:34.330631: I tensorflow/core/common_runtime/bfc_allocator.cc:658] Sum Total of in-use chunks: 3.49GiB
2019-03-21 17:35:34.330647: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Stats: 
Limit:                 11281553818
InUse:                  3746222592
MaxInUse:               5160234496
NumAllocs:                 8436392
MaxAllocSize:            734822400

2019-03-21 17:35:34.330746: W tensorflow/core/common_runtime/bfc_allocator.cc:275] ***********************************_________________________________________________________________
2019-03-21 17:35:34.330792: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at batch_matmul_op_impl.h:582 : Resource exhausted: OOM when allocating tensor with shape[16,8,7475,7475] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call
    return fn(*args)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,8,7475,7475] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node transformer/encoder_1/layer_0/multi_head/MatMul}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/encoder_1/layer_0/multi_head/mul, transformer/encoder_1/layer_0/multi_head/transpose)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[{{node transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/Shape/_1411}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3351_...rdot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopConstantFolding/transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/ListDiff-folded-0/_1289)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/root/anaconda3/bin/onmt-main", line 11, in <module>
    sys.exit(main())
  File "/root/anaconda3/lib/python3.6/site-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/root/anaconda3/lib/python3.6/site-packages/opennmt/runner.py", line 297, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
    saving_listeners)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
    run_metadata=run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1239, in run
    raise six.reraise(*original_exc_info)
  File "/root/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
    return self._sess.run(*args, **kwargs)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1304, in run
    run_metadata=run_metadata))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 581, in after_run
    if self._save(run_context.session, global_step):
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 606, in _save
    if l.after_save(session, step):
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 517, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 537, in _evaluate
    self._evaluator.evaluate_and_export())
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 912, in evaluate_and_export
    hooks=self._eval_spec.hooks)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 476, in evaluate
    return _evaluate()
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 469, in _evaluate
    output_dir=self.eval_dir(name))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1528, in _evaluate_run
    config=self._session_config)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/evaluation.py", line 212, in _evaluate_once
    session.run(eval_ops, feed_dict)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
    run_metadata=run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1239, in run
    raise six.reraise(*original_exc_info)
  File "/root/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
    return self._sess.run(*args, **kwargs)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1296, in run
    run_metadata=run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
    return self._sess.run(*args, **kwargs)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run
    run_metadata_ptr)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1110, in _run
    feed_dict_tensor, options, run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run
    run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,8,7475,7475] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node transformer/encoder_1/layer_0/multi_head/MatMul}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/encoder_1/layer_0/multi_head/mul, transformer/encoder_1/layer_0/multi_head/transpose)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[{{node transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/Shape/_1411}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3351_...rdot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopConstantFolding/transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/ListDiff-folded-0/_1289)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'transformer/encoder_1/layer_0/multi_head/MatMul', defined at:
  File "/root/anaconda3/bin/onmt-main", line 11, in <module>
    sys.exit(main())
  File "/root/anaconda3/lib/python3.6/site-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/root/anaconda3/lib/python3.6/site-packages/opennmt/runner.py", line 297, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 610, in run
    return self.run_local()
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 711, in run_local
    saving_listeners=saving_listeners)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
    saving_listeners)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
    run_metadata=run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
    run_metadata=run_metadata)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
    return self._sess.run(*args, **kwargs)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1304, in run
    run_metadata=run_metadata))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 581, in after_run
    if self._save(run_context.session, global_step):
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 606, in _save
    if l.after_save(session, step):
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 517, in after_save
    self._evaluate(global_step_value)  # updates self.eval_result
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 537, in _evaluate
    self._evaluator.evaluate_and_export())
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 912, in evaluate_and_export
    hooks=self._eval_spec.hooks)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 476, in evaluate
    return _evaluate()
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 462, in _evaluate
    self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1422, in _evaluate_build_graph
    self._call_model_fn_eval(input_fn, self.config))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1458, in _call_model_fn_eval
    features, labels, model_fn_lib.ModeKeys.EVAL, config)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/root/anaconda3/lib/python3.6/site-packages/opennmt/estimator.py", line 201, in _fn
    logits, predictions = local_model(features, labels, params, mode)
  File "/root/anaconda3/lib/python3.6/site-packages/opennmt/models/model.py", line 88, in __call__
    return self._call(features, labels, params, mode)
  File "/root/anaconda3/lib/python3.6/site-packages/opennmt/models/sequence_to_sequence.py", line 177, in _call
    mode=mode)
  File "/root/anaconda3/lib/python3.6/site-packages/opennmt/encoders/self_attention_encoder.py", line 77, in encode
    dropout=self.attention_dropout)
  File "/root/anaconda3/lib/python3.6/site-packages/opennmt/layers/transformer.py", line 285, in multi_head_attention
    dropout=dropout)
  File "/root/anaconda3/lib/python3.6/site-packages/opennmt/layers/transformer.py", line 192, in dot_product_attention
    dot = tf.matmul(queries, keys, transpose_b=True)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 2015, in matmul
    a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1245, in batch_mat_mul
    "BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
    self._traceback = tf_stack.extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,8,7475,7475] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node transformer/encoder_1/layer_0/multi_head/MatMul}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/encoder_1/layer_0/multi_head/mul, transformer/encoder_1/layer_0/multi_head/transpose)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[{{node transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/Shape/_1411}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3351_...rdot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopConstantFolding/transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/ListDiff-folded-0/_1289)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

guillaumekln · March 21, 2019, 6:42pm

You should probably clean up your evaluation files and remove very long sentences. The evaluation dataset is not filtered.

spatel · March 22, 2019, 2:10pm

Thanks Guillaume! I cleaned my entire dataset and now (min, max) length of all the sentences are (2, 500). This solves the issue of crashing. But now as I can see, once the checkpoint is saved after the save_checkpoints_steps and eval starts; it’s taking literally too much time to finish that part. Is that because of eval_delay parameter? My eval set is of about 1.5M sentences.

EDIT: My checkpoints are being generated at every 7500 sec and my eval_delay is 7200.

guillaumekln · March 22, 2019, 2:21pm

That’s way too much data, not sure what your plan was. An evaluation dataset usually contains no more than 1000-2000 sentences.

spatel · March 22, 2019, 8:28pm

Aha, I had no idea that we only need 1000-2000 sentences for evaluation with NMT. Anyway, this solved the issue of crashing! Thanks again!