Tensor conversion/ValueError when training with online tokenizer

lefterav · May 5, 2020, 7:49pm

I have been trying to reproduce the German to English WMT 19 experiment with Open NMT. Unfortunately, when using the online tokenizer, the training process fails with a tensor conversion error.

The error doesn’t occur if I comment out the source_tokenization and target_tokenization lines. I have previously used the same installation with online tokenizer with another language pair, without an issue.

The error that I get is the following:

I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
I tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Filling up shuffle buffer (this may take a while): 1742745 of 7412378
...
I tensorflow/core/kernels/data/shuffle_dataset_op.cc:199] Shuffle buffer filled.
W tensorflow/core/framework/op_kernel.cc:1643] Invalid argument: ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor: shape=(0,), dtype=float32, numpy=array([], dtype=float32)>

Traceback (most recent call last):

  File "software/env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 234, in __call__
    return func(device, token, args)

  File "software/env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 141, in __call__
    self._convert(ret, dtype=self._out_dtypes[0]), device_name)

  File "software/env/lib/python3.7/site-packages/tensorflow_core/python/ops/script_ops.py", line 112, in _convert
    return ops.convert_to_tensor(value, dtype=dtype)

  File "software/env/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1290, in convert_to_tensor
    (dtype.name, value.dtype.name, value))

ValueError: Tensor conversion requested dtype string for Tensor with dtype float32: <tf.Tensor: shape=(0,), dtype=float32, numpy=array([], dtype=float32)>

The config file looks like this:

model_dir: models/de-en_basic-aggr

data:
  train_features_file: data/train/train.de
  train_labels_file: data/train/train.en
  eval_features_file: data/dev/newstest2018.de-en.de
  eval_labels_file: data/dev/newstest2018.de-en.en
  source_vocabulary: vocab/train.de.aggr.vocab
  target_vocabulary: vocab/train.en.aggr.vocab
  source_tokenization: config/tokenizer.aggressive.de-en.yml
  target_tokenization: config/tokenizer.aggressive.de-en.yml

train:
  save_checkpoints_steps: 1000

eval:
  external_evaluators: BLEU

infer:
  batch_size: 32

And the config file of the tokenizer looks like this:

mode: aggressive
joiner_annotate: true
segment_numbers: true
segment_alphabet_change: true

I also tried various ways of preparing the vocabulary files. The vocabularies specified above are created with these commands:

onmt-build-vocab --size 32000 --tokenizer_config config/tokenizer.aggressive.de-en.yml --save_vocab vocab/train.de.aggr.vocab data/train/train.de
onmt-build-vocab --size 32000 --tokenizer_config config/tokenizer.aggressive.de-en.yml --save_vocab vocab/train.en.aggr.vocab data/train/train.en

And I also apart from the aggressive tokenization from the native OpenNMT tokenizer module I tried also using sentencepiece, but it gave the same error. Could you maybe spot the issue? I am stuck and I cannot see where the problem is. I would greatly appreciate.

guillaumekln · May 6, 2020, 8:44am

Thanks for reporting. I reproduced the error when the tokenizer is called on an empty line:

I pushed the fix to OpenNMT-tf 2.9.3. Could you update and try again?

lefterav · May 7, 2020, 12:56am

Yes, this solved the issue. It passed the point where it used to crash. Thanks for your immediate response!

lefterav · May 7, 2020, 8:22pm

Just a small heads up: after almost one day of training, the process seems to be pretty poor and slow, as compared to the version I started yesterday with pre-tokenized corpora. Haven’t gone very deep with debugging though.

guillaumekln · May 7, 2020, 8:28pm

It is slower but the accuracy should be the same.

lefterav · May 9, 2020, 4:00pm

Dunno, I am puzzled. At somewhat 60k steps the BLEU score was only 16, whereas the training with the pretokenized text after the same amount of steps got a BLEU score of 42. Of course, measuring BLEU score on word segments would give higher values, but this difference is huge.

guillaumekln · May 9, 2020, 4:15pm

Just to confirm, you used the same tokenization in both cases?

lefterav · May 11, 2020, 8:13pm

Yes, hm almost. in both cases I have tokenized with Sentencepiece. In the online scenario I followed the documentation of Sentenpiece, to generate a separate vocabulary for each language and constrain the vocabulary of the jointly trained tokenizer to the respective language. Maybe that was not a good idea. I will check again if everything was ok.