Training crash at validation


I’ve been trying to use the automatic export on best bleu score, but as soon as i enable these options… it crash.

    save_checkpoints_steps: 500
    #max_step: 500000
    #single_pass: true
    # (optional) How many checkpoints to keep on disk.
    keep_checkpoint_max: 10
    #effective_batch_size: 1

    steps: 500
    # Available scorers: bleu, rouge, wer, ter, prf
    scorers: bleu
    export_on_best: bleu
    export_format: saved_model
    max_exports_to_keep: 2

  n_best: 3
  with_scores: true
2021-06-30 15:13:16.661000: W] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/ calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
2021-06-30 15:13:20.703076: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.703234: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.705835: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.707624: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.709433: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.711257: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.713106: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.718170: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.737917: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.739712: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.740458: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.743680: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.744711: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.746572: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.749903: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.750904: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.751805: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.755037: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.755934: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.756768: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.760104: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.761013: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.761944: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:13:20.765847: W tensorflow/core/grappler/costs/] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "Tesla V100-SXM2-16GB" frequency: 1530 num_cores: 80 environment { key: "architecture" value: "7.0" } environment { key: "cuda" value: "11000" } environment { key: "cudnn" value: "8004" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 6291456 shared_memory_size_per_multiprocessor: 98304 memory_size: 15395979264 bandwidth: 898048000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2021-06-30 15:16:44.707000: I] Evaluation predictions saved to gdrive/MyDrive/VGR/en-fr/model/tf/OpenNMT-TF_fr_mtTransformer_tmtbpe_vs8000/eval/predictions.txt.500
Traceback (most recent call last):
  File "/usr/local/bin/onmt-main", line 8, in <module>
  File "/usr/local/lib/python3.7/dist-packages/opennmt/bin/", line 326, in main
  File "/usr/local/lib/python3.7/dist-packages/opennmt/", line 281, in train
  File "/usr/local/lib/python3.7/dist-packages/opennmt/", line 145, in __call__
    evaluator, step, moving_average=moving_average
  File "/usr/local/lib/python3.7/dist-packages/opennmt/", line 202, in _evaluate
  File "/usr/local/lib/python3.7/dist-packages/opennmt/", line 343, in __call__
    score = scorer(self._labels_file, output_path)
  File "/usr/local/lib/python3.7/dist-packages/opennmt/utils/", line 92, in __call__
    bleu = sacrebleu.corpus_bleu(sys_stream, [ref_stream], force=True)
  File "/usr/local/lib/python3.7/dist-packages/sacrebleu/", line 36, in corpus_bleu
    sys_stream, ref_streams, use_effective_order=use_effective_order)
  File "/usr/local/lib/python3.7/dist-packages/sacrebleu/metrics/", line 277, in corpus_score
    raise EOFError("System and reference streams have different lengths!")
EOFError: System and reference streams have different lengths!


Can you count the number of lines the file


and compare it against the number of lines in the files that you passed to eval_features_file and eval_labels_file?


Ok there are 3 times more records, which I believe is because of “n_best :3” under the infer parameters.

Is this the expected behaviour?

Ah, interesting.

Ideally we should not apply the n_best parameter when running the evaluation. I will check if it is easy to fix.

In the meantime you can just remove n_best from the configuration.

EDIT: I opened an issue in the repository:

1 Like

This issue is fixed in version 2.20.1.

1 Like