Tensorflow error with training command

Hi there ! I have an issue with the training command. I ran the quickstart, first with the quickstart files (chich are working) then with my own files. During the training process, I got this :

>     onmt-main --model_type Transformer --config data.yml --auto_config train --with_eval
WARNING:tensorflow:You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout.
INFO:tensorflow:Using parameters:
data:
  eval_features_file: src-val.txt
  eval_labels_file: tgt-val.txt
  source_vocabulary: src-vocab.txt
  target_vocabulary: tgt-vocab.txt
  train_features_file: src-train.txt
  train_labels_file: tgt-train.txt
eval:
  batch_size: 32
infer:
  batch_size: 32
  length_bucket_width: 5
model_dir: run/
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: NoamDecay
  label_smoothing: 0.1
  learning_rate: 2.0
  num_hypotheses: 1
  optimizer: LazyAdam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 500000
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_summary_steps: 100

2019-12-13 12:12:32.316005: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-12-13 12:12:32.642184: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fad2043d410 executing computations on platform Host. Devices:
2019-12-13 12:12:32.642215: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Restored checkpoint run/ckpt-0
INFO:tensorflow:Training on 4601352 examples
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/summary/summary_iterator.py:68: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
`tf.data.TFRecordDataset(path)`
INFO:tensorflow:Accumulate gradients of 9 iterations to reach effective batch size of 25000
WARNING:tensorflow:There is non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py:253: _EagerTensorBase.cpu (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.identity instead.
INFO:tensorflow:Saved checkpoint run/ckpt-0
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 983448064 elements. This may consume a large amount of memory.
  num_elements)
WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 1692146688 elements. This may consume a large amount of memory.
  num_elements)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 983448064 elements. This may consume a large amount of memory.
  num_elements)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 1692146688 elements. This may consume a large amount of memory.
  num_elements)
2019-12-13 12:38:30.662016: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 685235 of 4601352
2019-12-13 12:38:40.607234: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 1362959 of 4601352
2019-12-13 12:38:50.603162: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 2043940 of 4601352
2019-12-13 12:39:00.601302: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 2826203 of 4601352
2019-12-13 12:39:10.601205: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 3839627 of 4601352
2019-12-13 12:39:17.606390: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:193] Shuffle buffer filled.
[1]    7877 killed     onmt-main --model_type Transformer --config data.yml --auto_config train

I thought maybe I could be running out of memory but it seems weird to me (I have a macbook pro 2017, with a Intel Iris Plus Graphics 640 card and 8gb memory. Moreover opennmt-py works fine on my computer with these files, but not opennmt-tf).

Could someone tell me if I am running out of memory or if it is something else ? And if so, what it could be ?
Thank you !

Hi,

Can you monitor the system memory usage while the training is running?

To reduce memory usage during training, you can configure the buffer size used for data shuffling, e.g.:

train:
  sample_buffer_size: 1000000

The default is to load the full dataset in the memory.

Hi !

I tried that and it seemed to be working but eventually it outputs the same results… Could it be due to my data that is quite large ?

8GB is not a lot of system memory for training, you could try further reducing sample_buffer_size.

Are you planning to run the training on your MacBook? Or is just to test the workflow?

I was planning on running the training on my MacBook but after reducing the sample_buffer_size again and still having the issue, I am thinking of doing it on a server online or something.

Thank you very much for your help !