I followed the guide in here:
http://opennmt.net/OpenNMT-tf/quickstart.html
This is what I’m getting on terminal, what seems to be the issue:
(pyenv) nart@Nart-Ubuntu:~/toy-enru$ onmt-main --model_type Transformer --config data.yml --auto_config train --with_eval
INFO:tensorflow:Creating model directory run/
INFO:tensorflow:Using parameters:
data:
eval_features_file: src-val.txt
eval_labels_file: tgt-val.txt
source_vocabulary: src-vocab.txt
target_vocabulary: tgt-vocab.txt
train_features_file: src-train.txt
train_labels_file: tgt-train.txt
eval:
batch_size: 32
infer:
batch_size: 32
length_bucket_width: 5
model_dir: run/
params:
average_loss_in_time: true
beam_width: 4
decay_params:
model_dim: 512
warmup_steps: 8000
decay_type: NoamDecay
label_smoothing: 0.1
learning_rate: 2.0
num_hypotheses: 1
optimizer: LazyAdam
optimizer_params:
beta_1: 0.9
beta_2: 0.998
score:
batch_size: 64
train:
average_last_checkpoints: 8
batch_size: 3072
batch_type: tokens
effective_batch_size: 25000
keep_checkpoint_max: 8
length_bucket_width: 1
max_step: 500000
maximum_features_length: 100
maximum_labels_length: 100
sample_buffer_size: -1
save_summary_steps: 100
2019-10-30 13:31:24.752121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-10-30 13:31:24.855588: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2019-10-30 13:31:24.855718: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (Nart-Ubuntu): /proc/driver/nvidia/version does not exist
2019-10-30 13:31:24.856479: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: FMA
2019-10-30 13:31:24.888138: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3092345000 Hz
2019-10-30 13:31:24.888861: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x49c9b00 executing computations on platform Host. Devices:
2019-10-30 13:31:24.888929: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:No checkpoint to restore in run/
INFO:tensorflow:Training on 300000 examples
WARNING:tensorflow:From /home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/summary/summary_iterator.py:68: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
INFO:tensorflow:Accumulate gradients of 9 iterations to reach effective batch size of 25000
WARNING:tensorflow:There is non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
WARNING:tensorflow:From /home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py:253: _EagerTensorBase.cpu (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.identity instead.
INFO:tensorflow:Saved checkpoint run/ckpt-0
WARNING:tensorflow:From /home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2019-10-30 13:33:16.223285: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 614412288 exceeds 10% of system memory.
2019-10-30 13:33:16.561175: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at matmul_op_fused.cc:154 : Resource exhausted: OOM when allocating tensor with shape[3072,50001] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2019-10-30 13:33:16.636816: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[3072,50001] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node transformer/self_attention_decoder/dense_96/BiasAdd}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Traceback (most recent call last):
File "/home/nart/toy-enru/pyenv/bin/onmt-main", line 8, in <module>
sys.exit(main())
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/opennmt/bin/main.py", line 189, in main
checkpoint_path=args.checkpoint_path)
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/opennmt/runner.py", line 205, in train
export_on_best=eval_config.get("export_on_best"))
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/opennmt/training.py", line 146, in __call__
for i, (loss, num_words) in enumerate(_forward()): # pylint: disable=no-value-for-parameter
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/opennmt/data/dataset.py", line 433, in _fun
outputs = _tf_fun()
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
result = self._call(*args, **kwds)
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 520, in _call
return self._stateless_fn(*args, **kwds)
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
self.captured_inputs)
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
ctx, args, cancellation_manager=cancellation_manager)
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
ctx=ctx)
File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[3072,50001] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[node transformer/self_attention_decoder/dense_96/BiasAdd (defined at home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference__tf_fun_48625]
Function call stack:
_tf_fun
(pyenv) nart@Nart-Ubuntu:~/toy-enru$