please help me out and suggest better option ??
You might need to be more specific. Tagging every project is not the way. This error has to happen in a specific project.
Anyways, did you check your data? It seems there might be some empty lines.
It looks like this issue:
Thanks for suggesting, from the next question I will remember.
I’m getting the same error (detail outlined below). My setup is using OpenNMT 2.0 and a sentencepiece submodel.
When I use this setup to build a transformer model, it works perfectly. I’ve built 10+ transformer models with my transformer config. However, when I try building a vanilla rnn, it fails. My vanilla config is outlined below.
Having read through the forum, I have tried all steps:
- Ensuring no blank lines in training and validation set for both source and target
- No blank lines in vocab.src and vocab.tgt (initially there were blanks in these files which were removed and that allows me to start training which is now failing)
— start error —
[2021-05-06 09:45:45,900 INFO] corpus_1’s transforms: TransformPipe(SentencePieceTransform(share_vocab=False, src_subword_model=data/spm.model, tgt_subword_model=data/spm.model, src_subword_alpha=0, tgt_subword_alpha=0, src_subword_vocab=, tgt_subword_vocab=, src_vocab_threshold=0, tgt_vocab_threshold=0, src_subword_nbest=1, tgt_subword_nbest=1))
[2021-05-06 09:45:45,900 INFO] Loading ParallelCorpus(data/src-train.txt, data/tgt-train.txt, align=None)…
Traceback (most recent call last):
File “/home/seamus/.local/bin/onmt_train”, line 11, in
load_entry_point(‘OpenNMT-py’, ‘console_scripts’, ‘onmt_train’)()
File “/home/seamus/OpenNMT-py/onmt/bin/train.py”, line 169, in main
train(opt)
File “/home/seamus/OpenNMT-py/onmt/bin/train.py”, line 154, in train
train_process(opt, device_id=0)
File “/home/seamus/OpenNMT-py/onmt/train_single.py”, line 107, in main
trainer.train(
File “/home/seamus/OpenNMT-py/onmt/trainer.py”, line 242, in train
self._gradient_accumulation(
File “/home/seamus/OpenNMT-py/onmt/trainer.py”, line 366, in _gradient_accumulation
outputs, attns = self.model(
File “/home/seamus/.local/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/seamus/OpenNMT-py/onmt/models/model.py”, line 63, in forward
enc_state, memory_bank, lengths = self.encoder(src, lengths)
File “/home/seamus/.local/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/seamus/OpenNMT-py/onmt/encoders/rnn_encoder.py”, line 74, in forward
packed_emb = pack(emb, lengths_list)
File “/home/seamus/.local/lib/python3.8/site-packages/torch/nn/utils/rnn.py”, line 244, in pack_padded_sequence
_VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: Length of all samples has to be greater than 0, but found an element in ‘lengths’ that is <= 0
— end error —
CONFIG.YAML
Where the vocab(s) will be written
src_vocab: data/vocab.src
tgt_vocab: data/vocab.tgt
Tokenisation options
src_subword_type: sentencepiece
src_subword_model: data/spm.model
tgt_subword_type: sentencepiece
tgt_subword_model: data/spm.model
Number of candidates for SentencePiece sampling
subword_nbest: 20
Smoothing parameter for SentencePiece sampling
subword_alpha: 0.1
Specific arguments for pyonmttok
src_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
tgt_onmttok_kwargs: “{‘mode’: ‘none’, ‘spacer_annotate’: True}”
Filter
src_seq_length: 300
tgt_seq_length: 300
Corpus opts:
data:
corpus_1:
path_src: data/src-train.txt
path_tgt: data/tgt-train.txt
transforms: [sentencepiece, filtertoolong]
weight: 1
valid:
path_src: data/src-val.txt
path_tgt: data/tgt-val.txt
transforms: [sentencepiece]
Model Hyperparameters
save_data: data/
save_model: models/model
save_checkpoint_steps: 5000
valid_steps: 2500
train_steps: 200000
seed: 5151
report_every: 2000
keep_checkpoint: 5
rnn_size: 512
early_stopping: 4
early_stopping_criteria: accuracy
Logging
tensorboard: true
log_file: data/training_log.txt
Train on a single GPU
world_size: 1
gpu_ranks: [0]
Not sure where that could come from.
You might want to
- double check both sides of your data for potential empty lines (unusual whitespaces that would be removed when tokenizing for instance);
- make a reproducible example with a dataset that you can share.
You’re correct. There was a “blank line” in my source data. My script didn’t pick it up. It looks like it was a blank space on the line - I thought my script would have caught that. Strange how it was working using omnt_tokenize or using the SentencePiece tokenizer for transformer but not using SentencePiece tokenizer for vanilla rnn. Anyway it was the data and it’s sorted now. Thanks!