Training crash at validation

SamuelLacombe · June 30, 2021, 3:24pm

Hello,

I’ve been trying to use the automatic export on best bleu score, but as soon as i enable these options… it crash.

train:
    save_checkpoints_steps: 500
    #max_step: 500000
    #single_pass: true
    # (optional) How many checkpoints to keep on disk.
    keep_checkpoint_max: 10
    #effective_batch_size: 1

eval:
    steps: 500
    # Available scorers: bleu, rouge, wer, ter, prf
    scorers: bleu
    export_on_best: bleu
    export_format: saved_model
    max_exports_to_keep: 2

infer:
  n_best: 3
  with_scores: true

2021-06-30 15:13:16.661000: W deprecation.py:534] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
2021-06-30 15:13:20.703076: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.703234: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.705835: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.707624: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.709433: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.711257: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.713106: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.718170: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.737917: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.739712: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.740458: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.743680: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.744711: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.746572: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.749903: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.750904: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.751805: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.755037: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.755934: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.756768: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.760104: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.761013: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.761944: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.765847: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:16:44.707000: I training.py:202] Evaluation predictions saved to gdrive/MyDrive/VGR/en-fr/model/tf/OpenNMT-TF_fr_mtTransformer_tmtbpe_vs8000/eval/predictions.txt.500
Traceback (most recent call last):
  File "/usr/local/bin/onmt-main", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/opennmt/bin/main.py", line 326, in main
    hvd=hvd,
  File "/usr/local/lib/python3.7/dist-packages/opennmt/runner.py", line 281, in train
    moving_average_decay=train_config.get("moving_average_decay"),
  File "/usr/local/lib/python3.7/dist-packages/opennmt/training.py", line 145, in __call__
    evaluator, step, moving_average=moving_average
  File "/usr/local/lib/python3.7/dist-packages/opennmt/training.py", line 202, in _evaluate
    evaluator(step)
  File "/usr/local/lib/python3.7/dist-packages/opennmt/evaluation.py", line 343, in __call__
    score = scorer(self._labels_file, output_path)
  File "/usr/local/lib/python3.7/dist-packages/opennmt/utils/scorers.py", line 92, in __call__
    bleu = sacrebleu.corpus_bleu(sys_stream, [ref_stream], force=True)
  File "/usr/local/lib/python3.7/dist-packages/sacrebleu/compat.py", line 36, in corpus_bleu
    sys_stream, ref_streams, use_effective_order=use_effective_order)
  File "/usr/local/lib/python3.7/dist-packages/sacrebleu/metrics/bleu.py", line 277, in corpus_score
    raise EOFError("System and reference streams have different lengths!")
EOFError: System and reference streams have different lengths!

guillaumekln · June 30, 2021, 3:29pm

Hi,

Can you count the number of lines the file

gdrive/MyDrive/VGR/en-fr/model/tf/OpenNMT-TF_fr_mtTransformer_tmtbpe_vs8000/eval/predictions.txt.500

and compare it against the number of lines in the files that you passed to eval_features_file and eval_labels_file?

SamuelLacombe · June 30, 2021, 4:03pm

Hello,

Ok there are 3 times more records, which I believe is because of “n_best :3” under the infer parameters.

Is this the expected behaviour?

guillaumekln · June 30, 2021, 4:09pm

Ah, interesting.

Ideally we should not apply the n_best parameter when running the evaluation. I will check if it is easy to fix.

In the meantime you can just remove n_best from the configuration.

EDIT: I opened an issue in the repository:

guillaumekln · July 1, 2021, 4:14pm

This issue is fixed in version 2.20.1.