Empty line in corpus #####

Bouncyknighter · February 25, 2023, 1:00am

Hello,

I have started training a new dataset, but whenever I begin training, a terminal message appears stating “Empty line in corpus_1#4332…” etc. I am training the dataset on my CPU because PyTorch is not utilizing my GPU, but I am unsure if this is the issue. Should I reformat the training datasets to remove any empty lines?

The dataset contains Cyrillic text, and the target data is in English. I have already formatted the .txt files to remove empty lines, but the terminal message still appears stating that there are empty lines in the corpus.

SamuelLacombe · February 25, 2023, 3:05pm

Hello,

You mostlikely have an empty line in your tokenized file. Instead of looking at your initial file, focus or your resulting tokenized file. Once you have identified the line number that is empty, go back to your initial file and check what could have caused the problem.

Best regards,
Samuel

Bouncyknighter · February 25, 2023, 9:07pm

Thanks, I have solved the problem, using UTF-8 unicode. There was problem with the cyrrilic dataset with unicoding.