I have been trying to reproduce the German to English WMT 19 experiment with Open NMT. Unfortunately, when using the online tokenizer, the training process fails with a tensor conversion error.
The error doesn’t occur if I comment out the source_tokenization and target_tokenization lines. I have previously used the same installation with online tokenizer with another language pair, without an issue.
The error that I get is the following:
I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
I tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Filling up shuffle buffer (this may take a while): 1742745 of 7412378
...
I tensorflow/core/kernels/data/shuffle_dataset_op.cc:199] Shuffle buffer filled.
W tensorflow/core/framework/op_kernel.cc:1643] Invalid argument: ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor: shape=(0,), dtype=float32, numpy=array([], dtype=float32)>
Traceback (most recent call last):
File "software/env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 234, in __call__
return func(device, token, args)
File "software/env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 141, in __call__
self._convert(ret, dtype=self._out_dtypes[0]), device_name)
File "software/env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 112, in _convert
return ops.convert_to_tensor(value, dtype=dtype)
File "software/env/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1290, in convert_to_tensor
(dtype.name, value.dtype.name, value))
ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor: shape=(0,), dtype=float32, numpy=array([], dtype=float32)>
The config file looks like this:
model_dir: models/de-en_basic-aggr
data:
train_features_file: data/train/train.de
train_labels_file: data/train/train.en
eval_features_file: data/dev/newstest2018.de-en.de
eval_labels_file: data/dev/newstest2018.de-en.en
source_vocabulary: vocab/train.de.aggr.vocab
target_vocabulary: vocab/train.en.aggr.vocab
source_tokenization: config/tokenizer.aggressive.de-en.yml
target_tokenization: config/tokenizer.aggressive.de-en.yml
train:
save_checkpoints_steps: 1000
eval:
external_evaluators: BLEU
infer:
batch_size: 32
And the config file of the tokenizer looks like this:
mode: aggressive
joiner_annotate: true
segment_numbers: true
segment_alphabet_change: true
I also tried various ways of preparing the vocabulary files. The vocabularies specified above are created with these commands:
onmt-build-vocab --size 32000 --tokenizer_config config/tokenizer.aggressive.de-en.yml --save_vocab vocab/train.de.aggr.vocab data/train/train.de
onmt-build-vocab --size 32000 --tokenizer_config config/tokenizer.aggressive.de-en.yml --save_vocab vocab/train.en.aggr.vocab data/train/train.en
And I also apart from the aggressive tokenization from the native OpenNMT tokenizer module I tried also using sentencepiece, but it gave the same error. Could you maybe spot the issue? I am stuck and I cannot see where the problem is. I would greatly appreciate.