The problem of language model generating multi format input

I find that the input format of text generation is very fixed. The data I use now cannot use BPE format. Can I support the input of data in this format? I follow the steps below. In step 1, I can’t get the data of various datasets ,such as train.bpe, test.bpe, valid.bpe
I want to know how to deal with the data that is a single word, and BPE splitting and merging cannot be performed. whether the language model generation can support the data input of such a single word.
I mean about the input of single Chinese characters. The input generated by the language model is the result of BPE, but if my input is a single Chinese character, BPE cannot produce a vocabulary. In this case, how should I deal with the input problem. i will not get subword.bpe .
For example:
诺 氟 沙 星 遇 丙 二 酸 及 酸 酐 水 浴 加 热 后 显
找 朋 友 要 的 下 雪 的 他 们 那 边 今 天 下 了
我 爸 说 没 有 都 租 掉 了 的
谁 劳 动 去 了 自 己 摘 去 了
这 家 伙 这 俩 天 睡 不 醒 早 上 困 的 不 行 要 死

If your data is already tokenized (here you applied a character-based tokenization) then it does not make sense to apply BPE on it.

You should skip the BPE part in this example.

1 Like

Thank you very much for your answer. I will try as soon as you say.