Hi there ! I have an issue with the training command. I ran the quickstart, first with the quickstart files (chich are working) then with my own files. During the training process, I got this :
> onmt-main --model_type Transformer --config data.yml --auto_config train --with_eval
WARNING:tensorflow:You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout. INFO:tensorflow:Using parameters: data: eval_features_file: src-val.txt eval_labels_file: tgt-val.txt source_vocabulary: src-vocab.txt target_vocabulary: tgt-vocab.txt train_features_file: src-train.txt train_labels_file: tgt-train.txt eval: batch_size: 32 infer: batch_size: 32 length_bucket_width: 5 model_dir: run/ params: average_loss_in_time: true beam_width: 4 decay_params: model_dim: 512 warmup_steps: 8000 decay_type: NoamDecay label_smoothing: 0.1 learning_rate: 2.0 num_hypotheses: 1 optimizer: LazyAdam optimizer_params: beta_1: 0.9 beta_2: 0.998 score: batch_size: 64 train: average_last_checkpoints: 8 batch_size: 3072 batch_type: tokens effective_batch_size: 25000 keep_checkpoint_max: 8 length_bucket_width: 1 max_step: 500000 maximum_features_length: 100 maximum_labels_length: 100 sample_buffer_size: -1 save_summary_steps: 100 2019-12-13 12:12:32.316005: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-12-13 12:12:32.642184: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fad2043d410 executing computations on platform Host. Devices: 2019-12-13 12:12:32.642215: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version INFO:tensorflow:Restored checkpoint run/ckpt-0 INFO:tensorflow:Training on 4601352 examples WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/summary/summary_iterator.py:68: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version. Instructions for updating: Use eager execution and: `tf.data.TFRecordDataset(path)` INFO:tensorflow:Accumulate gradients of 9 iterations to reach effective batch size of 25000 WARNING:tensorflow:There is non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce. WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/constant_op.py:253: _EagerTensorBase.cpu (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.identity instead. INFO:tensorflow:Saved checkpoint run/ckpt-0 /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 983448064 elements. This may consume a large amount of memory. num_elements) WARNING:tensorflow:From /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1781: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version. Instructions for updating: If using Keras pass *_constraint arguments to layers. /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 1692146688 elements. This may consume a large amount of memory. num_elements) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 983448064 elements. This may consume a large amount of memory. num_elements) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_core/python/framework/indexed_slices.py:421: UserWarning: Converting sparse IndexedSlices to a dense Tensor with 1692146688 elements. This may consume a large amount of memory. num_elements) 2019-12-13 12:38:30.662016: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 685235 of 4601352 2019-12-13 12:38:40.607234: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 1362959 of 4601352 2019-12-13 12:38:50.603162: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 2043940 of 4601352 2019-12-13 12:39:00.601302: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 2826203 of 4601352 2019-12-13 12:39:10.601205: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:143] Filling up shuffle buffer (this may take a while): 3839627 of 4601352 2019-12-13 12:39:17.606390: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:193] Shuffle buffer filled. [1] 7877 killed onmt-main --model_type Transformer --config data.yml --auto_config train
I thought maybe I could be running out of memory but it seems weird to me (I have a macbook pro 2017, with a Intel Iris Plus Graphics 640 card and 8gb memory. Moreover opennmt-py works fine on my computer with these files, but not opennmt-tf).
Could someone tell me if I am running out of memory or if it is something else ? And if so, what it could be ?
Thank you !