The problem of language model generating multi format input

Andrewlesson · July 5, 2022, 7:23am

I find that the input format of text generation is very fixed. The data I use now cannot use BPE format. Can I support the input of data in this format? I follow the steps below. In step 1, I can’t get the data of various datasets ，such as train.bpe, test.bpe, valid.bpe
https://opennmt.net/OpenNMT-py/examples/LanguageModelGeneration.html
I want to know how to deal with the data that is a single word, and BPE splitting and merging cannot be performed. whether the language model generation can support the data input of such a single word.
I mean about the input of single Chinese characters. The input generated by the language model is the result of BPE, but if my input is a single Chinese character, BPE cannot produce a vocabulary. In this case, how should I deal with the input problem. i will not get subword.bpe .
For example:
诺氟沙星遇丙二酸及酸酐水浴加热后显
找朋友要的下雪的他们那边今天下了
我爸说没有都租掉了的
谁劳动去了自己摘去了
这家伙这俩天睡不醒早上困的不行要死

guillaumekln · July 5, 2022, 7:42am

If your data is already tokenized (here you applied a character-based tokenization) then it does not make sense to apply BPE on it.

You should skip the BPE part in this example.

Andrewlesson · July 6, 2022, 12:27pm

Thank you very much for your answer. I will try as soon as you say.