I want to ask whether the training set verification set data needs word segmentation

genbei · June 27, 2020, 6:57am

I’m a novice, I’m doing Chinese-English translation experiment. Do Chinese data need word segmentation?
I encountered this mistake:
raise value
AssertionError

genbei · June 28, 2020, 1:41am

francoishernandez · June 29, 2020, 7:14am

Hi there,
This is probably linked, as the trace seems to point at some issue in the vocabulary building process, which can’t be properly done if the data is not segmented.
You may try some simple “character” tokenization for instance to validate the rest of your pipeline.