Problems encountered during word segmentation，

wqt01 · May 25, 2023, 4:38am

Traceback (most recent call last):
File “d:\anaconda3\lib![](file:///C:\Users\wqt\AppData\Roaming\Tencent\QQTempSys%W@GJ$ACOF(TYDYECOKVDYB.png)runpy.py”, line 194, in run_module_as_main
return run_code(code, main_globals, None,
File “d:\anaconda3\lib![](file:///C:\Users\wqt\AppData\Roaming\Tencent\QQTempSys%W@GJ$ACOF(TYDYECOKVDYB.png)runpy.py”, line 87, in run_code
exec(code, run_globals)
File "D:\Anaconda3\Scripts\onmt-build-vocab.exe_main.py", line 7, in
File "d:\anaconda3\lib\site-packages\opennmt\bin\build![](file:///C:\Users\wqt\AppData\Roaming\Tencent\QQTempSys%W@GJ$ACOF(TYDYECOKVDYB.png)vocab.py", line 153, in main
vocab.add_from_text(data_file, tokenizer=tokenizer)
File “d:\anaconda3\lib\site-packages\opennmt\data![](file:///C:\Users\wqt\AppData\Roaming\Tencent\QQTempSys%W@GJ$ACOF(TYDYECOKVDYB.png)vocab.py”, line 87, in add_from_text
for line in text:
File "d:\anaconda3\lib\site-packages\tensorflow\python\lib\io\file![](file:///C:\Users\wqt\AppData\Roaming\Tencent\QQTempSys%W@GJ$ACOF(TYDYECOKVDYB.png)io.py", line 203, in next
retval = self.readline()
File “d:\anaconda3\lib\site-packages\tensorflow\python\lib\io\file_![](file:///C:\Users\wqt\AppData\Roaming\Tencent\QQTempSys%W@GJ$ACOF(TYDYECOKVDYB.png)io.py”, line 167, in readline
self.preread_check()
File "d:\anaconda3\lib\site-packages\tensorflow\python\lib\io\file![](file:///C:\Users\wqt\AppData\Roaming\Tencent\QQTempSys%W@GJ$ACOF(TYDYECOKVDYB.png)io.py", line 76, in _preread_check
self._read_buf = _pywrap_file_io.BufferedInputStream(
UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xd5 in position 92: invalid continuation byte

The corpus is an ancient Chinese text.

guillaumekln · May 25, 2023, 8:12am

See for example:

wqt01 · June 2, 2023, 7:11pm

Sorry, it may be that I am quite dull and bothering you again. There is no answer to my question in the example you provided. I have been making changes for several consecutive days without resolving them. I’m very sorry. Can I ask again if using word segmentation or other annotations for Chinese in the Windows system is a solution to generating invalid continuation bytes?

guillaumekln · June 5, 2023, 8:15am

What’s the encoding of your training file?

wqt01 · June 5, 2023, 2:48pm

Thank you for your reply. On the second day of my inquiry, I have resolved the issue. Thank you for your help