I’m a novice, I’m doing Chinese-English translation experiment. Do Chinese data need word segmentation?
I encountered this mistake:
raise value
AssertionError
Hi there,
This is probably linked, as the trace seems to point at some issue in the vocabulary building process, which can’t be properly done if the data is not segmented.
You may try some simple “character” tokenization for instance to validate the rest of your pipeline.