Tensorflow error with training command

Basile · December 13, 2019, 12:56pm

Hi there ! I have an issue with the training command. I ran the quickstart, first with the quickstart files (chich are working) then with my own files. During the training process, I got this :

>     onmt-main --model_type Transformer --config data.yml --auto_config train --with_eval

WARNING:tensorflow:You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout.
INFO:tensorflow:Using parameters:
data:
  eval_features_file: src-val.txt
  eval_labels_file: tgt-val.txt
  source_vocabulary: src-vocab.txt
  target_vocabulary: tgt-vocab.txt
  train_features_file: src-train.txt
  train_labels_file: tgt-train.txt
eval:
  batch_size: 32
infer:
  batch_size: 32
  length_bucket_width: 5
model_dir: run/
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: NoamDecay
  label_smoothing: 0.1
  learning_rate: 2.0
  num_hypotheses: 1
  optimizer: LazyAdam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 500000
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_summary_steps: 100

2019-12-13 12:12:32.316005: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-13 12:12:32.642184: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fad2043d410 executing computations on platform Host. Devices:
2019-12-13 12:12:32.642215: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Restored checkpoint run/ckpt-0
INFO:tensorflow:Training on 4601352 examples
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/summary/summary_iterator.py:68: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
INFO:tensorflow:Accumulate gradients of 9 iterations to reach effective batch size of 25000
WARNING:tensorflow:There is non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py:253: _EagerTensorBase.cpu (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.identity instead.
INFO:tensorflow:Saved checkpoint run/ckpt-0
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 983448064 elements. This may consume a large amount of memory.
  num_elements)
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 1692146688 elements. This may consume a large amount of memory.
  num_elements)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 983448064 elements. This may consume a large amount of memory.
  num_elements)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 1692146688 elements. This may consume a large amount of memory.
  num_elements)
2019-12-13 12:38:30.662016: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 685235 of 4601352
2019-12-13 12:38:40.607234: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 1362959 of 4601352
2019-12-13 12:38:50.603162: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 2043940 of 4601352
2019-12-13 12:39:00.601302: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 2826203 of 4601352
2019-12-13 12:39:10.601205: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 3839627 of 4601352
2019-12-13 12:39:17.606390: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:193] Shuffle buffer filled.
[1]    7877 killed     onmt-main --model_type Transformer --config data.yml --auto_config train

I thought maybe I could be running out of memory but it seems weird to me (I have a macbook pro 2017, with a Intel Iris Plus Graphics 640 card and 8gb memory. Moreover opennmt-py works fine on my computer with these files, but not opennmt-tf).

Could someone tell me if I am running out of memory or if it is something else ? And if so, what it could be ?
Thank you !

guillaumekln · December 13, 2019, 1:32pm

Hi,

Can you monitor the system memory usage while the training is running?

To reduce memory usage during training, you can configure the buffer size used for data shuffling, e.g.:

train:
  sample_buffer_size: 1000000

The default is to load the full dataset in the memory.

Basile · December 13, 2019, 3:15pm

Hi !

I tried that and it seemed to be working but eventually it outputs the same results… Could it be due to my data that is quite large ?

guillaumekln · December 17, 2019, 1:55pm

8GB is not a lot of system memory for training, you could try further reducing sample_buffer_size.

Are you planning to run the training on your MacBook? Or is just to test the workflow?

Basile · December 18, 2019, 1:59pm

I was planning on running the training on my MacBook but after reducing the sample_buffer_size again and still having the issue, I am thinking of doing it on a server online or something.

Thank you very much for your help !