Hi
I created new source and target files, with features, using the | separator.
In order to do that, I created the source file with utf-8 encoding.
Now when I try to run build vocab, it fails to read the Unicode character:
[2020-11-23 20:57:37,067 INFO] Counter vocab from 1500000 samples.
[2020-11-23 20:57:37,067 INFO] Build vocab on 1500000 transformed examples/corpus.
[2020-11-23 20:57:53,395 INFO] Counters src:639
[2020-11-23 20:57:53,395 INFO] Counters tgt:45
Traceback (most recent call last):
File “c:\programdata\anaconda3\lib\runpy.py”, line 193, in _run_module_as_main
“main”, mod_spec)
File “c:\programdata\anaconda3\lib\runpy.py”, line 85, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\onmt_build_vocab.exe_main.py", line 7, in
File “c:\programdata\anaconda3\lib\site-packages\onmt\bin\build_vocab.py”, line 66, in main
build_vocab_main(opts)
File “c:\programdata\anaconda3\lib\site-packages\onmt\bin\build_vocab.py”, line 53, in build_vocab_main
save_counter(src_counter, opts.src_vocab)
File “c:\programdata\anaconda3\lib\site-packages\onmt\bin\build_vocab.py”, line 45, in save_counter
fo.write(tok + “\t” + str(count) + “\n”)
File “c:\programdata\anaconda3\lib\encodings\cp1252.py”, line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: ‘charmap’ codec can’t encode character ‘\uffe8’ in position 1: character maps to
What am I doing wrong? Thanks