UnicodeEncodeError for build vocab

mayaKaplansky · November 23, 2020, 9:17pm

Hi
I created new source and target files, with features, using the | separator.
In order to do that, I created the source file with utf-8 encoding.
Now when I try to run build vocab, it fails to read the Unicode character:

[2020-11-23 20:57:37,067 INFO] Counter vocab from 1500000 samples.
[2020-11-23 20:57:37,067 INFO] Build vocab on 1500000 transformed examples/corpus.
[2020-11-23 20:57:53,395 INFO] Counters src:639
[2020-11-23 20:57:53,395 INFO] Counters tgt:45
Traceback (most recent call last):
File “c:\programdata\anaconda3\lib\runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “c:\programdata\anaconda3\lib\runpy.py”, line 85, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\onmt_build_vocab.exe_main.py", line 7, in
File “c:\programdata\anaconda3\lib\site-packages\onmt\bin\build_vocab.py”, line 66, in main
build_vocab_main(opts)
File “c:\programdata\anaconda3\lib\site-packages\onmt\bin\build_vocab.py”, line 53, in build_vocab_main
save_counter(src_counter, opts.src_vocab)
File “c:\programdata\anaconda3\lib\site-packages\onmt\bin\build_vocab.py”, line 45, in save_counter
fo.write(tok + “\t” + str(count) + “\n”)
File “c:\programdata\anaconda3\lib\encodings\cp1252.py”, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\uffe8’ in position 1: character maps to

What am I doing wrong? Thanks

francoishernandez · November 23, 2020, 9:44pm

Source features are not supported yet in OpenNMT-py 2.0.
You may experiment with the legacy branch on GitHub, or versions < 2 on pypi.

mayaKaplansky · November 24, 2020, 6:30pm

Hi
I’m not sure how to use either options. Can you point me to a guide?
And does the tf version support features?

francoishernandez · November 24, 2020, 9:10pm

legacy branch: https://github.com/OpenNMT/OpenNMT-py/tree/legacy
1.2.0 version on pypi: https://pypi.org/project/OpenNMT-py/1.2.0/
(this is basically the same thing)

The online docs are for v2 but you can have a look there: https://github.com/OpenNMT/OpenNMT-py/tree/legacy/docs/source

For the features, you just need to add your features with the | character before preprocessing your data.

mayaKaplansky · November 25, 2020, 6:59pm

Thank you!