Hi Guillaume!
I have this under the data
section:
sequence_controls:
start: true
end: true
When I remove it, the training goes well. However, when I add them, at the evaluation step, I get the following error, and the training stops. I also tried updating TensorFlow. I checked the development files, they seem good and no empty lines.
2022-01-28 06:46:53.501588: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A4000" frequency: 1560 num_cores: 48 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 102400 memory_size: 14905966592 bandwidth: 448064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2022-01-28 06:46:53.505463: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A4000" frequency: 1560 num_cores: 48 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 102400 memory_size: 14905966592 bandwidth: 448064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2022-01-28 06:46:53.506445: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A4000" frequency: 1560 num_cores: 48 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 102400 memory_size: 14905966592 bandwidth: 448064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2022-01-28 06:46:53.510348: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A4000" frequency: 1560 num_cores: 48 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 102400 memory_size: 14905966592 bandwidth: 448064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2022-01-28 06:46:53.511383: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A4000" frequency: 1560 num_cores: 48 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 102400 memory_size: 14905966592 bandwidth: 448064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2022-01-28 06:46:53.515215: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A4000" frequency: 1560 num_cores: 48 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 102400 memory_size: 14905966592 bandwidth: 448064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2022-01-28 06:46:53.516219: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A4000" frequency: 1560 num_cores: 48 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 102400 memory_size: 14905966592 bandwidth: 448064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2022-01-28 06:46:53.520053: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A4000" frequency: 1560 num_cores: 48 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 102400 memory_size: 14905966592 bandwidth: 448064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2022-01-28 06:46:53.521083: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A4000" frequency: 1560 num_cores: 48 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 102400 memory_size: 14905966592 bandwidth: 448064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
2022-01-28 06:46:53.524347: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:689] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA RTX A4000" frequency: 1560 num_cores: 48 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 4194304 shared_memory_size_per_multiprocessor: 102400 memory_size: 14905966592 bandwidth: 448064000 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }
Traceback (most recent call last):
File "/home/machine/.venvs/onmttf/bin/onmt-main", line 8, in <module>
sys.exit(main())
File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/bin/main.py", line 312, in main
hvd=hvd,
File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/runner.py", line 284, in train
moving_average_decay=train_config.get("moving_average_decay"),
File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/training.py", line 135, in __call__
evaluator, step, moving_average=moving_average
File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/training.py", line 192, in _evaluate
evaluator(step)
File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/evaluation.py", line 319, in __call__
loss, predictions = self._eval_fn(source, target)
File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) INVALID_ARGUMENT: indices[16,0] = 32 is not in [0, 32)
[[node transformer_big_relative_1/GatherV2_1
(defined at /home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py:586)
]]
[[transformer_big_relative_1/strided_slice_24/_254]]
(1) INVALID_ARGUMENT: indices[16,0] = 32 is not in [0, 32)
[[node transformer_big_relative_1/GatherV2_1
(defined at /home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py:586)
]]
0 successful operations.
0 derived errors ignored. [Op:__inference_evaluate_111405]
Errors may have originated from an input operation.
Input Source operations connected to node transformer_big_relative_1/GatherV2_1:
In[0] transformer_big_relative_1/tile_batch_3/Reshape (defined at /home/machine/.venvs/onmttf/lib/python3.7/site-packages/tensorflow_addons/seq2seq/beam_search_decoder.py:119)
In[1] transformer_big_relative_1/ArgMax (defined at /home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py:585)
In[2] transformer_big_relative_1/GatherV2_1/axis:
Operation defined at: (most recent call last)
>>> File "/home/machine/.venvs/onmttf/bin/onmt-main", line 8, in <module>
>>> sys.exit(main())
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/bin/main.py", line 312, in main
>>> hvd=hvd,
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/runner.py", line 284, in train
>>> moving_average_decay=train_config.get("moving_average_decay"),
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/training.py", line 135, in __call__
>>> evaluator, step, moving_average=moving_average
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/training.py", line 192, in _evaluate
>>> evaluator(step)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/evaluation.py", line 319, in __call__
>>> loss, predictions = self._eval_fn(source, target)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/model.py", line 163, in evaluate
>>> outputs, predictions = self(features, labels=labels)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/model.py", line 103, in __call__
>>> outputs, predictions = super().__call__(
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>> return fn(*args, **kwargs)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/keras/engine/base_layer.py", line 1083, in __call__
>>> outputs = call_fn(inputs, *args, **kwargs)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
>>> return fn(*args, **kwargs)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 180, in call
>>> if not training:
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 181, in call
>>> predictions = self._dynamic_decode(
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 323, in _dynamic_decode
>>> if params.get("replace_unknown_target", False):
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 354, in _dynamic_decode
>>> replaced_target_tokens = replace_unknown_target(
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 606, in replace_unknown_target
>>> aligned_source_tokens = align_tokens_from_attention(source_tokens, attention)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 586, in align_tokens_from_attention
>>> return tf.gather(tokens, alignment, axis=1, batch_dims=1)
>>>
Input Source operations connected to node transformer_big_relative_1/GatherV2_1:
In[0] transformer_big_relative_1/tile_batch_3/Reshape (defined at /home/machine/.venvs/onmttf/lib/python3.7/site-packages/tensorflow_addons/seq2seq/beam_search_decoder.py:119)
In[1] transformer_big_relative_1/ArgMax (defined at /home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py:585)
In[2] transformer_big_relative_1/GatherV2_1/axis:
Operation defined at: (most recent call last)
>>> File "/home/machine/.venvs/onmttf/bin/onmt-main", line 8, in <module>
>>> sys.exit(main())
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/bin/main.py", line 312, in main
>>> hvd=hvd,
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/runner.py", line 284, in train
>>> moving_average_decay=train_config.get("moving_average_decay"),
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/training.py", line 135, in __call__
>>> evaluator, step, moving_average=moving_average
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/training.py", line 192, in _evaluate
>>> evaluator(step)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/evaluation.py", line 319, in __call__
>>> loss, predictions = self._eval_fn(source, target)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/model.py", line 163, in evaluate
>>> outputs, predictions = self(features, labels=labels)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/model.py", line 103, in __call__
>>> outputs, predictions = super().__call__(
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
>>> return fn(*args, **kwargs)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/keras/engine/base_layer.py", line 1083, in __call__
>>> outputs = call_fn(inputs, *args, **kwargs)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
>>> return fn(*args, **kwargs)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 180, in call
>>> if not training:
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 181, in call
>>> predictions = self._dynamic_decode(
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 323, in _dynamic_decode
>>> if params.get("replace_unknown_target", False):
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 354, in _dynamic_decode
>>> replaced_target_tokens = replace_unknown_target(
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 606, in replace_unknown_target
>>> aligned_source_tokens = align_tokens_from_attention(source_tokens, attention)
>>>
>>> File "/home/machine/.venvs/onmttf/lib/python3.7/site-packages/opennmt/models/sequence_to_sequence.py", line 586, in align_tokens_from_attention
>>> return tf.gather(tokens, alignment, axis=1, batch_dims=1)
>>>
Function call stack:
evaluate -> evaluate
Thanks!
Yasmin