Model training: openNMT says that I have empty lines in the data, but I don't!

kargintima · November 18, 2019, 4:11pm

I am using OpenNMT-py.
Everything in the http://opennmt.net/OpenNMT-py/quickstart.html works well.
Now I want to train on my data.
The files are clean, I even tried to make simple 30-lines files. I can see that there are no empty lines.
I tried to show any types of characters, but there is nothing except ‘\n’.
So, I am not sure what to try.
OpenNMT says:

[2019-11-18 19:08:16,547 INFO] Step 10000/100000; acc:  64.88; ppl:  6.94; xent: 1.94; lr: 1.00000; 11576/11541 tok/s;    365 sec
[2019-11-18 19:08:16,547 INFO] Loading dataset from data/demo.valid.0.pt
[2019-11-18 19:08:21,239 INFO] number of examples: 487196
Traceback (most recent call last):
  File "/home/nmt/.pyenv/versions/main/bin/onmt_train", line 11, in <module>
    load_entry_point('OpenNMT-py==1.0.0rc2', 'console_scripts', 'onmt_train')()
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/bin/train.py", line 200, in main
    train(opt)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/bin/train.py", line 86, in train
    single_main(opt, 0)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/train_single.py", line 143, in main
    valid_steps=opt.valid_steps)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/trainer.py", line 258, in train
    valid_iter, moving_average=self.moving_average)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/trainer.py", line 314, in validate
    outputs, attns = valid_model(src, tgt, src_lengths)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/models/model.py", line 42, in forward
    enc_state, memory_bank, lengths = self.encoder(src, lengths)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/encoders/rnn_encoder.py", line 74, in forward
    packed_emb = pack(emb, lengths_list)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/utils/rnn.py", line 275, in pack_padded_sequence
    _VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0

francoishernandez · November 18, 2019, 7:48pm

I tried to show any types of characters, but there is nothing except ‘\n’.

Not sure to follow here, but are you saying that when this is raised, you only have \n in your line?
If so, that is the very definition of an empty line.

kargintima · November 19, 2019, 2:26pm

Just a little misunderstanding.
I have ‘\n’ in the end of every line. And there is no any other special characters in my files.
Example:

me@mypc:~/mydir$ cat -A data/train.clean.en
the hotel stands on a hill$
how are things going with tom$
they became nervous$
when did you tell tom$

And so on.

francoishernandez · November 19, 2019, 4:12pm

Oh, that’s strange indeed. It seems this happens on your valid set. Are you sure there are no empty line there either? Could there be one at the end of the file for instance?
You may try to load it and loop over it to identify the issue more easily. Something like that for instance:

import torch
data = torch.load("your_valid_pt_file")
for i, item in enumerate(data.examples):
    if len(item.src) == 0:
        print("found an empty line", i)

kargintima · November 20, 2019, 9:06am

I checked again with different text editors - no empty lines.
About your script.
It says, that “AttributeError: ‘dict’ object has no attribute ‘examples’”.
And I am new to pytorch so I am not sure how it works. Maybe I should tell the script that “data” isn’t just a python’s dict?

francoishernandez · November 20, 2019, 9:24am

This is not supposed to happen. Looks like you tried to load a checkpoint, not the validation set shard. It should be something like “<your_dataset_name>.valid.0.pt”.

kargintima · November 20, 2019, 9:32am

Yes, you are right, wrong file.
I tried to load my demo.valid.0.pt, but this script outputs nothing. So, everything is fine.

francoishernandez · November 20, 2019, 1:01pm

Hmmm, there is not much more we can do from there.
Would you mind sharing this sample validation shard so that I can take a look?

kargintima · November 20, 2019, 1:14pm

Sure:

francoishernandez · November 20, 2019, 1:42pm

My bad, the example script I gave you missed one level of indexing for src (not sure why it’s stored in a nested list but that’s another matter).
So, with the proper indexing:

for i, item in enumerate(data.examples): 
    if len(item.src[0]) == 0: 
        print(item.__dict__)

there are a lot of empty sources:

{'src': [[]], 'tgt': [['может', 'для', 'него', 'это', 'будет', 'то', 'же', 'самое']], 'indices': 167}
{'src': [[]], 'tgt': [['я', 'люблю', 'тебя']], 'indices': 256}
{'src': [[]], 'tgt': [['я', 'вас', 'люблю']], 'indices': 258}
{'src': [[]], 'tgt': [['ты', 'обращаешься', 'ко', 'мне']], 'indices': 345}
{'src': [[]], 'tgt': [['это', 'то', 'что', 'я', 'искал', 'воскликнул', 'он']], 'indices': 365}

and so on.

kargintima · November 20, 2019, 1:55pm

Ok, but how?

kargintima · November 21, 2019, 7:05am

Also I have 487166 lines of train data and 3000 lines of val data. And this script says that 484250 of source are empty. That is huge portion.

francoishernandez · November 21, 2019, 9:28am

This very much looks like there was some mix up with your datasets.
The shard you sent is called “valid”, and contains 487196 targets, but only 2946 non empty sources.

kargintima · November 21, 2019, 11:12am

Yes, everything works now. I still don’t know what happend, but I removed all “data” directory and created it again, prepared data again (the same way as before) and now everything works.