First time training with OpenNMT-tf. I’m using the transformer configuration, and training is happy enough, but when it’s time for eval, it pukes on a line that contains placeholder text (which is seen plenty during training): ᚘ22ᚆ


Why do we see the UnicodeDecodeError during eval but not during training?

The evaluation saves the translations in a file, that’s where encoding issues usually happens.

I had some hard time trying to reproduce it but it looks like the locales are not configured properly on your system. Can you check the output of:


and if possible configure them. What is your OS?

Ah, yeah, running in Docker and didn’t do any locale settings. Maybe we could use…, encoding=‘utf-8’, mode=‘a’) in utils/ so it doesn’t matter what the environment variables are.

I took a quick look at the code, and that would only fix the issues where stream is a file. Still need to have locale set for stdio.

Mmh yes, might need to revise this code. It’s a bit tricky as it should work on Python 2 and Python 3, for files and stdout.

Yes, in Python 3.x, open ==, so open is preferred. But for backwards compatibility with 2.7.x, is the way to go. I think it’s ok to leave in the 3rd party code, but ideally, the handling would be consistent throughout the code. I submitted a PR for changing the plain “open” to “”, but I left out what to do about stdout, as this is a separate issue, I think.

I see the PR is failing. Will investigate further…

Thanks for the PR! Linking to it below for future reference:

