Tensor conversion/ValueError when training with online tokenizer

I have been trying to reproduce the German to English WMT 19 experiment with Open NMT. Unfortunately, when using the online tokenizer, the training process fails with a tensor conversion error.

The error doesn’t occur if I comment out the source_tokenization and target_tokenization lines. I have previously used the same installation with online tokenizer with another language pair, without an issue.

The error that I get is the following:

I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
I tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Filling up shuffle buffer (this may take a while): 1742745 of 7412378
...
I tensorflow/core/kernels/data/shuffle_dataset_op.cc:199] Shuffle buffer filled.
W tensorflow/core/framework/op_kernel.cc:1643] Invalid argument: ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor: shape=(0,), dtype=float32, numpy=array([], dtype=float32)>

Traceback (most recent call last):

  File "software/env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 234, in __call__
    return func(device, token, args)

  File "software/env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 141, in __call__
    self._convert(ret, dtype=self._out_dtypes[0]), device_name)

  File "software/env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 112, in _convert
    return ops.convert_to_tensor(value, dtype=dtype)

  File "software/env/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1290, in convert_to_tensor
    (dtype.name, value.dtype.name, value))

ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor: shape=(0,), dtype=float32, numpy=array([], dtype=float32)>

The config file looks like this:

model_dir: models/de-en_basic-aggr

data:
  train_features_file: data/train/train.de
  train_labels_file: data/train/train.en
  eval_features_file: data/dev/newstest2018.de-en.de
  eval_labels_file: data/dev/newstest2018.de-en.en
  source_vocabulary: vocab/train.de.aggr.vocab
  target_vocabulary: vocab/train.en.aggr.vocab
  source_tokenization: config/tokenizer.aggressive.de-en.yml
  target_tokenization: config/tokenizer.aggressive.de-en.yml

train:
  save_checkpoints_steps: 1000

eval:
  external_evaluators: BLEU

infer:
  batch_size: 32

And the config file of the tokenizer looks like this:

mode: aggressive
joiner_annotate: true
segment_numbers: true
segment_alphabet_change: true

I also tried various ways of preparing the vocabulary files. The vocabularies specified above are created with these commands:

onmt-build-vocab --size 32000 --tokenizer_config config/tokenizer.aggressive.de-en.yml --save_vocab vocab/train.de.aggr.vocab data/train/train.de
onmt-build-vocab --size 32000 --tokenizer_config config/tokenizer.aggressive.de-en.yml --save_vocab vocab/train.en.aggr.vocab data/train/train.en

And I also apart from the aggressive tokenization from the native OpenNMT tokenizer module I tried also using sentencepiece, but it gave the same error. Could you maybe spot the issue? I am stuck and I cannot see where the problem is. I would greatly appreciate.

Thanks for reporting. I reproduced the error when the tokenizer is called on an empty line:

I pushed the fix to OpenNMT-tf 2.9.3. Could you update and try again?

1 Like

Yes, this solved the issue. It passed the point where it used to crash. Thanks for your immediate response!

Just a small heads up: after almost one day of training, the process seems to be pretty poor and slow, as compared to the version I started yesterday with pre-tokenized corpora. Haven’t gone very deep with debugging though.

It is slower but the accuracy should be the same.

Dunno, I am puzzled. At somewhat 60k steps the BLEU score was only 16, whereas the training with the pretokenized text after the same amount of steps got a BLEU score of 42. Of course, measuring BLEU score on word segments would give higher values, but this difference is huge.

Just to confirm, you used the same tokenization in both cases?

Yes, hm almost. in both cases I have tokenized with Sentencepiece. In the online scenario I followed the documentation of Sentenpiece, to generate a separate vocabulary for each language and constrain the vocabulary of the jointly trained tokenizer to the respective language. Maybe that was not a good idea. I will check again if everything was ok.