Training crashes at evaluation due to UnicodeDecodeError

First time training with OpenNMT-tf. I’m using the transformer configuration, and training is happy enough, but when it’s time for eval, it pukes on a line that contains placeholder text (which is seen plenty during training): ᚘ22ᚆ

Traceback:

2018-03-29 15:32:19.109885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1052] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11432 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN X, pci bus id: 0000:03:00.0, compute capability: 5.2)
INFO:tensorflow:Restoring parameters from /data/exp04/models/model.ckpt-9278
INFO:tensorflow:Running local_init_op.
2018-03-29 15:32:19.911414: I tensorflow/core/kernels/lookup_util.cc:362] Table trying to initialize from file /data/exp04/models/sentpiece/dell_en-ja_spm_50k.vocab is already initialized.
INFO:tensorflow:Done running local_init_op.
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/OpenNMT-tf/bin/main.py", line 135, in <module>
    main()
  File "/root/OpenNMT-tf/bin/main.py", line 116, in main
    runner.train_and_evaluate()
  File "/root/OpenNMT-tf/opennmt/runner.py", line 138, in train_and_evaluate
    tf.estimator.train_and_evaluate(self._estimator, train_spec, eval_spec)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 439, in train_and_evaluate
    executor.run()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 518, in run
    self.run_local()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 657, in run_local
    eval_result = evaluator.evaluate_and_export()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/training.py", line 847, in evaluate_and_export
    hooks=self._eval_spec.hooks)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 418, in evaluate
    name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 965, in _evaluate_model
    config=self._session_config)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/evaluation.py", line 212, in _evaluate_once
    session.run(eval_ops, feed_dict)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1178, in run
    run_metadata=run_metadata))
  File "/root/OpenNMT-tf/opennmt/utils/hooks.py", line 134, in after_run
    self._model.print_prediction(prediction, stream=output_file)
  File "/root/OpenNMT-tf/opennmt/models/sequence_to_sequence.py", line 242, in print_prediction
    print_bytes(tf.compat.as_bytes(sentence), stream=stream)
  File "/root/OpenNMT-tf/opennmt/utils/misc.py", line 26, in print_bytes
    text = str_as_bytes.decode(encoding) if encoding != "ascii" else str_as_bytes
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 8: ordinal not in range(128)

Why do we see the UnicodeDecodeError during eval but not during training?

The evaluation saves the translations in a file, that’s where encoding issues usually happens.

I had some hard time trying to reproduce it but it looks like the locales are not configured properly on your system. Can you check the output of:

locale

and if possible configure them. What is your OS?

Ah, yeah, running in Docker and didn’t do any locale settings. Maybe we could use io.open(…, encoding=‘utf-8’, mode=‘a’) in utils/hooks.py so it doesn’t matter what the environment variables are.

Edit:
I took a quick look at the code, and that would only fix the issues where stream is a file. Still need to have locale set for stdio.

Mmh yes, might need to revise this code. It’s a bit tricky as it should work on Python 2 and Python 3, for files and stdout.

Yes, in Python 3.x, open == io.open, so open is preferred. But for backwards compatibility with 2.7.x, io.open is the way to go. I think it’s ok to leave codecs.open in the 3rd party code, but ideally, the handling would be consistent throughout the code. I submitted a PR for changing the plain “open” to “io.open”, but I left out what to do about stdout, as this is a separate issue, I think.

Update:
I see the PR is failing. Will investigate further…

Thanks for the PR! Linking to it below for future reference:

1 Like