Getting errors with the train command

Nart · October 30, 2019, 11:31am

I followed the guide in here:
http://opennmt.net/OpenNMT-tf/quickstart.html

This is what I’m getting on terminal, what seems to be the issue:

(pyenv) nart@Nart-Ubuntu:~/toy-enru$ onmt-main --model_type Transformer --config data.yml --auto_config train --with_eval
INFO:tensorflow:Creating model directory run/
INFO:tensorflow:Using parameters:
data:
  eval_features_file: src-val.txt
  eval_labels_file: tgt-val.txt
  source_vocabulary: src-vocab.txt
  target_vocabulary: tgt-vocab.txt
  train_features_file: src-train.txt
  train_labels_file: tgt-train.txt
eval:
  batch_size: 32
infer:
  batch_size: 32
  length_bucket_width: 5
model_dir: run/
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: NoamDecay
  label_smoothing: 0.1
  learning_rate: 2.0
  num_hypotheses: 1
  optimizer: LazyAdam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 500000
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_summary_steps: 100

2019-10-30 13:31:24.752121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-10-30 13:31:24.855588: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2019-10-30 13:31:24.855718: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (Nart-Ubuntu): /proc/driver/nvidia/version does not exist
2019-10-30 13:31:24.856479: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: FMA
2019-10-30 13:31:24.888138: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3092345000 Hz
2019-10-30 13:31:24.888861: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x49c9b00 executing computations on platform Host. Devices:
2019-10-30 13:31:24.888929: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
WARNING:tensorflow:No checkpoint to restore in run/
INFO:tensorflow:Training on 300000 examples
WARNING:tensorflow:From /home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/summary/summary_iterator.py:68: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
INFO:tensorflow:Accumulate gradients of 9 iterations to reach effective batch size of 25000
WARNING:tensorflow:There is non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
WARNING:tensorflow:From /home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py:253: _EagerTensorBase.cpu (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.identity instead.
INFO:tensorflow:Saved checkpoint run/ckpt-0
WARNING:tensorflow:From /home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2019-10-30 13:33:16.223285: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 614412288 exceeds 10% of system memory.
2019-10-30 13:33:16.561175: W tensorflow/core/framework/op_kernel.cc:1622] OP_REQUIRES failed at matmul_op_fused.cc:154 : Resource exhausted: OOM when allocating tensor with shape[3072,50001] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2019-10-30 13:33:16.636816: W tensorflow/core/common_runtime/base_collective_executor.cc:216] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[3072,50001] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[{{node transformer/self_attention_decoder/dense_96/BiasAdd}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

Traceback (most recent call last):
  File "/home/nart/toy-enru/pyenv/bin/onmt-main", line 8, in <module>
    sys.exit(main())
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/opennmt/bin/main.py", line 189, in main
    checkpoint_path=args.checkpoint_path)
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/opennmt/runner.py", line 205, in train
    export_on_best=eval_config.get("export_on_best"))
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/opennmt/training.py", line 146, in __call__
    for i, (loss, num_words) in enumerate(_forward()):  # pylint: disable=no-value-for-parameter
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/opennmt/data/dataset.py", line 433, in _fun
    outputs = _tf_fun()
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 457, in __call__
    result = self._call(*args, **kwds)
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py", line 520, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1823, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1141, in _filtered_call
    self.captured_inputs)
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 1224, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager)
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py", line 511, in call
    ctx=ctx)
  File "/home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[3072,50001] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[node transformer/self_attention_decoder/dense_96/BiasAdd (defined at home/nart/toy-enru/pyenv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1751) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference__tf_fun_48625]

Function call stack:
_tf_fun

(pyenv) nart@Nart-Ubuntu:~/toy-enru$

guillaumekln · October 30, 2019, 11:40am

You are running out of memory.

See this general recommendation about the system requirements to train a NMT model: http://opennmt.net/FAQ/#what-type-of-computer-do-i-need-to-train-with

Nart · October 31, 2019, 1:08pm

I had a look at the recommendations.
I have this graphic card installed (Sapphire Radeon Nitro Rx 470 4GB):
This is from terminal:
description: VGA compatible controller
product: Ellesmere [Radeon RX 470/480/570/570X/580/580X]
vendor: Advanced Micro Devices, Inc. [AMD/ATI]
physical id: 0
bus info: pci@0000:01:00.0
version: cf
width: 64 bits
clock: 33MHz
capabilities: pm pciexpress msi vga_controller bus_master cap_list rom
configuration: driver=amdgpu latency=0
resources: irq:32 memory:e0000000-efffffff memory:f0000000-f01fffff
ioport:e000(size=256) memory:fea00000-fea3ffff memory:c0000-dffff