Issues while train the data using pytorch on 180MB dataset

Hi,

Presently I am training the model for ENG to GER language translation.

The dataset has following number of lines :
SOURCE DATA
(base) fujiadmin@offlinetranslation:~/pytorch/OpenNMT-py/data$ wc -l europarl-v7.de-en.en
1920209 europarl-v7.de-en.en

DESTINATION DATA
(base) fujiadmin@offlinetranslation:~/pytorch/OpenNMT-py/data$ wc -l europarl-v7.de-en.de
1920209 europarl-v7.de-en.de

Out of the above 1920209 lines I have used 5000 lines as validation text for both source and destination.

I am getting the following error message after the training 10000 epochs of the total of 100000 epochs.

The error messages are as follows:


number of examples: 924406 as marked in RED

** number of examples: 5000** as marked in RED

Could someone please assist me in resolving this issue ?

Thank You,
Kishor.

Hi,
Are there any empty lines in your validation set?

Hi,

Yeah there are many empty lines in validation set. How do we get rid of them ?

Thank you,
Kishor.

Few python lines that should do the trick:

with open("your_valid_source") as vs, open("your_valid_target") as vt,\
    open("clean_valid_source") as cvs, open("clean_valid_target") as cvt:
    for s, t in zip(vs, vt):
        if s.strip() == "" or t.strip() == "":
            cvs.write(s)
            cvt.write(t)

You could also achieve this with some combination of paste and awk in unix shell.