I want to ask whether the training set verification set data needs word segmentation

I’m a novice, I’m doing Chinese-English translation experiment. Do Chinese data need word segmentation?
I encountered this mistake:
raise value
AssertionError

Hi there,
This is probably linked, as the trace seems to point at some issue in the vocabulary building process, which can’t be properly done if the data is not segmented.
You may try some simple “character” tokenization for instance to validate the rest of your pipeline.