My training data is about 30M sentences, with 32,000 source/target vocab. I’m running this on AWS p2.xlarge with 1GPU of 11GB memory. I have already tokenized data using SentencePiece. As the evaluate starts after every checkpoint as defined in config file save_checkpoints_steps: 2000
, training crashes at first step of eval. Any suggestion ?
- I’m using following training command:
onmt-main train_and_eval --config parameters.yaml --model_type TransformerAAN --num_gpus 1 --seed 42 --gpu_allow_growth
- Here is my
parameters.yaml
file configuration:
model_dir: es_en_model
data:
train_features_file: /opt/data/es_en/es_en.train.source
train_labels_file: /opt/data/es_en/es_en.train.target
train_alignments: /opt/data/es_en/es_en.forward.align.train
eval_features_file: /opt/data/es_en/es_en.eval.source
eval_labels_file: /opt/data/es_en/es_en.eval.target
source_words_vocabulary: /opt/data/es_en/es_en.es.onmt.vocab
target_words_vocabulary: /opt/data/es_en/es_en.en.onmt.vocab
params:
optimizer: AdamOptimizer
learning_rate: 0.0002
clip_gradients: 5.0
regularization:
type: l2
scale: 1e-4
weight_decay: 0.01
average_loss_in_time: true
decay_type: exponential_decay
decay_params:
decay_rate: 0.7
decay_steps: 50000
decay_step_duration: 1
start_decay_steps: 100000
minimum_learning_rate: 0.0001
maximum_learning_rate: 1e6
guided_alignment_type: ce
guided_alignment_weight: 1
train:
batch_size: 32
batch_type: examples
effective_batch_size: 32
save_checkpoints_steps: 2000
save_checkpoints_secs: null
keep_checkpoint_max: 6
save_summary_steps: 100
train_steps: 500000
single_pass: false
maximum_features_length: 70
maximum_labels_length: 70
bucket_width: 5
num_threads: 4
prefetch_buffer_size: null
sample_buffer_size: 500000
average_last_checkpoints: 6
eval:
batch_size: 16
num_threads: 4
prefetch_buffer_size: 1
maximum_features_length: 70
maximum_labels_length: 70
eval_delay: 7200
save_eval_predictions: true
external_evaluators: sacreBLEU
exporters: best
- Here is the last error log of training:
INFO:tensorflow:loss = 11.1917, step = 5800 (46.043 sec)
INFO:tensorflow:source_words/sec: 1955
INFO:tensorflow:target_words/sec: 1952
INFO:tensorflow:global_step/sec: 2.19816
INFO:tensorflow:loss = 12.558362, step = 5900 (45.493 sec)
INFO:tensorflow:source_words/sec: 1938
INFO:tensorflow:target_words/sec: 1939
INFO:tensorflow:Saving checkpoints for 6000 into es_en_model/model.ckpt.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-03-21-17:32:22
INFO:tensorflow:Graph was finalized.
2019-03-21 17:32:22.728300: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-03-21 17:32:22.728361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-21 17:32:22.728376: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2019-03-21 17:32:22.728382: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2019-03-21 17:32:22.728540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10758 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
INFO:tensorflow:Restoring parameters from es_en_model/model.ckpt-6000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2019-03-21 17:35:34.305203: W tensorflow/core/common_runtime/bfc_allocator.cc:271] Allocator (GPU_0_bfc) ran out of memory trying to allocate 26.64GiB. Current allocation summary follows.
...
...
...
2019-03-21 17:35:34.330541: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 69740544 totalling 66.51MiB
2019-03-21 17:35:34.330559: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 70780928 totalling 67.50MiB
2019-03-21 17:35:34.330579: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 73934848 totalling 70.51MiB
2019-03-21 17:35:34.330598: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 1 Chunks of size 77088768 totalling 73.52MiB
2019-03-21 17:35:34.330617: I tensorflow/core/common_runtime/bfc_allocator.cc:654] 4 Chunks of size 244940800 totalling 934.38MiB
2019-03-21 17:35:34.330631: I tensorflow/core/common_runtime/bfc_allocator.cc:658] Sum Total of in-use chunks: 3.49GiB
2019-03-21 17:35:34.330647: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Stats:
Limit: 11281553818
InUse: 3746222592
MaxInUse: 5160234496
NumAllocs: 8436392
MaxAllocSize: 734822400
2019-03-21 17:35:34.330746: W tensorflow/core/common_runtime/bfc_allocator.cc:275] ***********************************_________________________________________________________________
2019-03-21 17:35:34.330792: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at batch_matmul_op_impl.h:582 : Resource exhausted: OOM when allocating tensor with shape[16,8,7475,7475] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1292, in _do_call
return fn(*args)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1277, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,8,7475,7475] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node transformer/encoder_1/layer_0/multi_head/MatMul}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/encoder_1/layer_0/multi_head/mul, transformer/encoder_1/layer_0/multi_head/transpose)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/Shape/_1411}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3351_...rdot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopConstantFolding/transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/ListDiff-folded-0/_1289)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/anaconda3/bin/onmt-main", line 11, in <module>
sys.exit(main())
File "/root/anaconda3/lib/python3.6/site-packages/opennmt/bin/main.py", line 172, in main
runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
File "/root/anaconda3/lib/python3.6/site-packages/opennmt/runner.py", line 297, in train_and_evaluate
result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 610, in run
return self.run_local()
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 711, in run_local
saving_listeners=saving_listeners)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
saving_listeners)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
run_metadata=run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1239, in run
raise six.reraise(*original_exc_info)
File "/root/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
return self._sess.run(*args, **kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1304, in run
run_metadata=run_metadata))
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 581, in after_run
if self._save(run_context.session, global_step):
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 606, in _save
if l.after_save(session, step):
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 517, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 537, in _evaluate
self._evaluator.evaluate_and_export())
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 912, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 476, in evaluate
return _evaluate()
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 469, in _evaluate
output_dir=self.eval_dir(name))
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1528, in _evaluate_run
config=self._session_config)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/evaluation.py", line 212, in _evaluate_once
session.run(eval_ops, feed_dict)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
run_metadata=run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1239, in run
raise six.reraise(*original_exc_info)
File "/root/anaconda3/lib/python3.6/site-packages/six.py", line 693, in reraise
raise value
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
return self._sess.run(*args, **kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1296, in run
run_metadata=run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1076, in run
return self._sess.run(*args, **kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 887, in run
run_metadata_ptr)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1110, in _run
feed_dict_tensor, options, run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1286, in _do_run
run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1308, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,8,7475,7475] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node transformer/encoder_1/layer_0/multi_head/MatMul}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/encoder_1/layer_0/multi_head/mul, transformer/encoder_1/layer_0/multi_head/transpose)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/Shape/_1411}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3351_...rdot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopConstantFolding/transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/ListDiff-folded-0/_1289)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Caused by op 'transformer/encoder_1/layer_0/multi_head/MatMul', defined at:
File "/root/anaconda3/bin/onmt-main", line 11, in <module>
sys.exit(main())
File "/root/anaconda3/lib/python3.6/site-packages/opennmt/bin/main.py", line 172, in main
runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
File "/root/anaconda3/lib/python3.6/site-packages/opennmt/runner.py", line 297, in train_and_evaluate
result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 610, in run
return self.run_local()
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 711, in run_local
saving_listeners=saving_listeners)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
saving_listeners)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1409, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 671, in run
run_metadata=run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1148, in run
run_metadata=run_metadata)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1224, in run
return self._sess.run(*args, **kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1304, in run
run_metadata=run_metadata))
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 581, in after_run
if self._save(run_context.session, global_step):
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 606, in _save
if l.after_save(session, step):
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 517, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 537, in _evaluate
self._evaluator.evaluate_and_export())
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 912, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 476, in evaluate
return _evaluate()
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 462, in _evaluate
self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1422, in _evaluate_build_graph
self._call_model_fn_eval(input_fn, self.config))
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1458, in _call_model_fn_eval
features, labels, model_fn_lib.ModeKeys.EVAL, config)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1169, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/opennmt/estimator.py", line 201, in _fn
logits, predictions = local_model(features, labels, params, mode)
File "/root/anaconda3/lib/python3.6/site-packages/opennmt/models/model.py", line 88, in __call__
return self._call(features, labels, params, mode)
File "/root/anaconda3/lib/python3.6/site-packages/opennmt/models/sequence_to_sequence.py", line 177, in _call
mode=mode)
File "/root/anaconda3/lib/python3.6/site-packages/opennmt/encoders/self_attention_encoder.py", line 77, in encode
dropout=self.attention_dropout)
File "/root/anaconda3/lib/python3.6/site-packages/opennmt/layers/transformer.py", line 285, in multi_head_attention
dropout=dropout)
File "/root/anaconda3/lib/python3.6/site-packages/opennmt/layers/transformer.py", line 192, in dot_product_attention
dot = tf.matmul(queries, keys, transpose_b=True)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/math_ops.py", line 2015, in matmul
a, b, adj_x=adjoint_a, adj_y=adjoint_b, name=name)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1245, in batch_mat_mul
"BatchMatMul", x=x, y=y, adj_x=adj_x, adj_y=adj_y, name=name)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def)
File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[16,8,7475,7475] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node transformer/encoder_1/layer_0/multi_head/MatMul}} = BatchMatMul[T=DT_FLOAT, adj_x=false, adj_y=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](transformer/encoder_1/layer_0/multi_head/mul, transformer/encoder_1/layer_0/multi_head/transpose)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/Shape/_1411}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3351_...rdot/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_cloopConstantFolding/transformer/decoder_2/while/layer_1/average_attention/dense/Tensordot/ListDiff-folded-0/_1289)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.