OpenNMT Forum

Model training: openNMT says that I have empty lines in the data, but I don't!

I am using OpenNMT-py.
Everything in the http://opennmt.net/OpenNMT-py/quickstart.html works well.
Now I want to train on my data.
The files are clean, I even tried to make simple 30-lines files. I can see that there are no empty lines.
I tried to show any types of characters, but there is nothing except ‘\n’.
So, I am not sure what to try.
OpenNMT says:

[2019-11-18 19:08:16,547 INFO] Step 10000/100000; acc:  64.88; ppl:  6.94; xent: 1.94; lr: 1.00000; 11576/11541 tok/s;    365 sec
[2019-11-18 19:08:16,547 INFO] Loading dataset from data/demo.valid.0.pt
[2019-11-18 19:08:21,239 INFO] number of examples: 487196
Traceback (most recent call last):
  File "/home/nmt/.pyenv/versions/main/bin/onmt_train", line 11, in <module>
    load_entry_point('OpenNMT-py==1.0.0rc2', 'console_scripts', 'onmt_train')()
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/bin/train.py", line 200, in main
    train(opt)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/bin/train.py", line 86, in train
    single_main(opt, 0)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/train_single.py", line 143, in main
    valid_steps=opt.valid_steps)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/trainer.py", line 258, in train
    valid_iter, moving_average=self.moving_average)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/trainer.py", line 314, in validate
    outputs, attns = valid_model(src, tgt, src_lengths)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/models/model.py", line 42, in forward
    enc_state, memory_bank, lengths = self.encoder(src, lengths)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/OpenNMT_py-1.0.0rc2-py3.7.egg/onmt/encoders/rnn_encoder.py", line 74, in forward
    packed_emb = pack(emb, lengths_list)
  File "/home/nmt/.pyenv/versions/main/lib/python3.7/site-packages/torch/nn/utils/rnn.py", line 275, in pack_padded_sequence
    _VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0

I tried to show any types of characters, but there is nothing except ‘\n’.

Not sure to follow here, but are you saying that when this is raised, you only have \n in your line?
If so, that is the very definition of an empty line.

Just a little misunderstanding.
I have ‘\n’ in the end of every line. And there is no any other special characters in my files.
Example:

me@mypc:~/mydir$ cat -A data/train.clean.en
the hotel stands on a hill$
how are things going with tom$
they became nervous$
when did you tell tom$

And so on.

Oh, that’s strange indeed. It seems this happens on your valid set. Are you sure there are no empty line there either? Could there be one at the end of the file for instance?
You may try to load it and loop over it to identify the issue more easily. Something like that for instance:

import torch
data = torch.load("your_valid_pt_file")
for i, item in enumerate(data.examples):
    if len(item.src) == 0:
        print("found an empty line", i)

I checked again with different text editors - no empty lines.
About your script.
It says, that “AttributeError: ‘dict’ object has no attribute ‘examples’”.
And I am new to pytorch so I am not sure how it works. Maybe I should tell the script that “data” isn’t just a python’s dict?

This is not supposed to happen. Looks like you tried to load a checkpoint, not the validation set shard. It should be something like “<your_dataset_name>.valid.0.pt”.

Yes, you are right, wrong file.
I tried to load my demo.valid.0.pt, but this script outputs nothing. So, everything is fine.

Hmmm, there is not much more we can do from there.
Would you mind sharing this sample validation shard so that I can take a look?

Sure:

My bad, the example script I gave you missed one level of indexing for src (not sure why it’s stored in a nested list but that’s another matter).
So, with the proper indexing:

for i, item in enumerate(data.examples): 
    if len(item.src[0]) == 0: 
        print(item.__dict__)  

there are a lot of empty sources:

{'src': [[]], 'tgt': [['может', 'для', 'него', 'это', 'будет', 'то', 'же', 'самое']], 'indices': 167}
{'src': [[]], 'tgt': [['я', 'люблю', 'тебя']], 'indices': 256}
{'src': [[]], 'tgt': [['я', 'вас', 'люблю']], 'indices': 258}
{'src': [[]], 'tgt': [['ты', 'обращаешься', 'ко', 'мне']], 'indices': 345}
{'src': [[]], 'tgt': [['это', 'то', 'что', 'я', 'искал', 'воскликнул', 'он']], 'indices': 365}

and so on.

Ok, but how?

Also I have 487166 lines of train data and 3000 lines of val data. And this script says that 484250 of source are empty. That is huge portion.

This very much looks like there was some mix up with your datasets.
The shard you sent is called “valid”, and contains 487196 targets, but only 2946 non empty sources.

Yes, everything works now. I still don’t know what happend, but I removed all “data” directory and created it again, prepared data again (the same way as before) and now everything works.