OpenNMT

RuntimeError: Length of all samples has to be greater than 0, but found an element in 'lengths' that is <= 0

please help me out and suggest better option ??

You might need to be more specific. Tagging every project is not the way. This error has to happen in a specific project.
Anyways, did you check your data? It seems there might be some empty lines.

It looks like this issue:

1 Like

Thanks for suggesting, from the next question I will remember.

I’m getting the same error (detail outlined below). My setup is using OpenNMT 2.0 and a sentencepiece submodel.

When I use this setup to build a transformer model, it works perfectly. I’ve built 10+ transformer models with my transformer config. However, when I try building a vanilla rnn, it fails. My vanilla config is outlined below.

Having read through the forum, I have tried all steps:

  1. Ensuring no blank lines in training and validation set for both source and target
  2. No blank lines in vocab.src and vocab.tgt (initially there were blanks in these files which were removed and that allows me to start training which is now failing)
    — start error —

[2021-05-06 09:45:45,900 INFO] corpus_1’s transforms: TransformPipe(SentencePieceTransform(share_vocab=False, src_subword_model=data/spm.model, tgt_subword_model=data/spm.model, src_subword_alpha=0, tgt_subword_alpha=0, src_subword_vocab=, tgt_subword_vocab=, src_vocab_threshold=0, tgt_vocab_threshold=0, src_subword_nbest=1, tgt_subword_nbest=1))
[2021-05-06 09:45:45,900 INFO] Loading ParallelCorpus(data/src-train.txt, data/tgt-train.txt, align=None)…
Traceback (most recent call last):
File “/home/seamus/.local/bin/onmt_train”, line 11, in
load_entry_point(‘OpenNMT-py’, ‘console_scripts’, ‘onmt_train’)()
File “/home/seamus/OpenNMT-py/onmt/bin/train.py”, line 169, in main
train(opt)
File “/home/seamus/OpenNMT-py/onmt/bin/train.py”, line 154, in train
train_process(opt, device_id=0)
File “/home/seamus/OpenNMT-py/onmt/train_single.py”, line 107, in main
trainer.train(
File “/home/seamus/OpenNMT-py/onmt/trainer.py”, line 242, in train
self._gradient_accumulation(
File “/home/seamus/OpenNMT-py/onmt/trainer.py”, line 366, in _gradient_accumulation
outputs, attns = self.model(
File “/home/seamus/.local/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/seamus/OpenNMT-py/onmt/models/model.py”, line 63, in forward
enc_state, memory_bank, lengths = self.encoder(src, lengths)
File “/home/seamus/.local/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/seamus/OpenNMT-py/onmt/encoders/rnn_encoder.py”, line 74, in forward
packed_emb = pack(emb, lengths_list)
File “/home/seamus/.local/lib/python3.8/site-packages/torch/nn/utils/rnn.py”, line 244, in pack_padded_sequence
_VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: Length of all samples has to be greater than 0, but found an element in ‘lengths’ that is <= 0

— end error —

CONFIG.YAML

Where the vocab(s) will be written

src_vocab: data/vocab.src
tgt_vocab: data/vocab.tgt

Tokenisation options

src_subword_type: sentencepiece
src_subword_model: data/spm.model
tgt_subword_type: sentencepiece
tgt_subword_model: data/spm.model

Number of candidates for SentencePiece sampling

subword_nbest: 20

Smoothing parameter for SentencePiece sampling

subword_alpha: 0.1

Specific arguments for pyonmttok

src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”

Filter

src_seq_length: 300
tgt_seq_length: 300

Corpus opts:

data:
corpus_1:
path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
transforms: [sentencepiece, filtertoolong]
weight: 1
valid:
path_src: data/src-val.txt
path_tgt: data/tgt-val.txt
transforms: [sentencepiece]

Model Hyperparameters

save_data: data/
save_model: models/model
save_checkpoint_steps: 5000
valid_steps: 2500
train_steps: 200000
seed: 5151
report_every: 2000
keep_checkpoint: 5
rnn_size: 512

early_stopping: 4
early_stopping_criteria: accuracy

Logging

tensorboard: true
log_file: data/training_log.txt

Train on a single GPU

world_size: 1
gpu_ranks: [0]

Not sure where that could come from.
You might want to

  1. double check both sides of your data for potential empty lines (unusual whitespaces that would be removed when tokenizing for instance);
  2. make a reproducible example with a dataset that you can share.

You’re correct. There was a “blank line” in my source data. My script didn’t pick it up. It looks like it was a blank space on the line - I thought my script would have caught that. Strange how it was working using omnt_tokenize or using the SentencePiece tokenizer for transformer but not using SentencePiece tokenizer for vanilla rnn. Anyway it was the data and it’s sorted now. Thanks!