Runtime Error When Model tries to Validate

BaruchG · November 6, 2019, 9:00pm

My model crashes when it tries to validate itself during training. I get the following error (I can copy above if it helps, but this is all that seems relevant):

[2019-11-06 20:28:37,784 INFO] Step 10000/200000; acc:  55.44; ppl:  8.81; xent: 2.18; lr: 0.00088; 55214/63742 tok/s;   4501 sec
[2019-11-06 20:28:37,786 INFO] Loading dataset from fa2en/fa2en.valid.0.pt
[2019-11-06 20:28:37,896 INFO] number of examples: 5000
Traceback (most recent call last):
  File "/opt/conda/bin/onmt_train", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 200, in main
    train(opt)
  File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 82, in train
    p.join()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 184, in signal_handler
    raise Exception(msg)
Exception: 

-- Tracebacks above this line can probably
                 be ignored --

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 142, in run
    single_main(opt, device_id, batch_queue, semaphore)
  File "/opt/conda/lib/python3.6/site-packages/onmt/train_single.py", line 143, in main
    valid_steps=opt.valid_steps)
  File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 258, in train
    valid_iter, moving_average=self.moving_average)
  File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 314, in validate
    outputs, attns = valid_model(src, tgt, src_lengths)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/onmt/models/model.py", line 42, in forward
    enc_state, memory_bank, lengths = self.encoder(src, lengths)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/onmt/encoders/transformer.py", line 121, in forward
    emb = self.embeddings(src)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/onmt/modules/embeddings.py", line 273, in forward
    source = module(source, step=step)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/onmt/modules/embeddings.py", line 50, in forward
    emb = emb + self.pe[:emb.size(0)]
RuntimeError: The size of tensor a (6376) must match the size of tensor b (5000) at non-singleton dimension 0

The length of my validation files are both 5000 lines (I double checked after I got this error), and I assume that is what “tensor b” is over there. I’m using version 1.0.0rc2 of OpenNMT-py and this is the first time that I’ve used the pip installed version, but it has worked perfectly in the past on a git cloned version. I used sentencepiece for tokenizing/encoding. The command that I used for preprocessing is:
onmt_preprocess -train_src full/src-train.txt.token -train_tgt full/tgt-train.txt.token -valid_src full/src-validation.txt.token -valid_tgt full/tgt-validation.txt.token -save_data fa2en/fa2en .
The command that I used for training is:
export CUDA_VISIBLE_DEVICES=0,1,2,3
onmt_train -data fa2en/fa2en -save_model models/fa-en \ -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \ -encoder_type transformer -decoder_type transformer -position_encoding \ -train_steps 200000 -max_generator_batches 2 -dropout 0.1 \ -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 \ -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \ -max_grad_norm 0 -param_init 0 -param_init_glorot \ -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \ -world_size 4 -gpu_ranks 0 1 2 3 2>> output.txt

Thanks for your help!

EDIT: Just tried running again, and now I’m getting a memory issue. I’ve run this many times in the past without issues and it has 4 32 gb gpu’s available to it. Here’s the error:

[2019-11-07 18:10:08,103 INFO] Loading dataset from fa2en3/fa2en.train.3.pt
[2019-11-07 18:10:14,380 INFO] Step 950/200000; acc:  31.78; ppl: 48.07; xent: 3.87; lr: 0.00012; 52493/67860 tok/s;    443 sec
[2019-11-07 18:10:24,312 INFO] number of examples: 999636
[2019-11-07 18:10:35,701 INFO] Step 1000/200000; acc:  33.13; ppl: 42.99; xent: 3.76; lr: 0.00012; 51367/68638 tok/s;    464 sec
[2019-11-07 18:10:35,704 INFO] Loading dataset from fa2en3/fa2en.valid.0.pt
[2019-11-07 18:10:35,803 INFO] number of examples: 5000
Traceback (most recent call last):
  File "/opt/conda/bin/onmt_train", line 10, in <module>
Process SpawnProcess-2:
    sys.exit(main())
  File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 200, in main
    train(opt)
  File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 82, in train
    p.join()
  File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
Traceback (most recent call last):
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 142, in run
    single_main(opt, device_id, batch_queue, semaphore)
  File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
  File "/opt/conda/lib/python3.6/site-packages/onmt/train_single.py", line 143, in main
    valid_steps=opt.valid_steps)
    pid, sts = os.waitpid(self.pid, flag)
  File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 258, in train
    valid_iter, moving_average=self.moving_average)
  File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 184, in signal_handler
  File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 314, in validate
    outputs, attns = valid_model(src, tgt, src_lengths)
    raise Exception(msg)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/onmt/models/model.py", line 47, in forward
    memory_lengths=lengths)
Exception: 

-- Tracebacks above this line can probably
                 be ignored --

Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 142, in run
    single_main(opt, device_id, batch_queue, semaphore)
  File "/opt/conda/lib/python3.6/site-packages/onmt/train_single.py", line 143, in main
    valid_steps=opt.valid_steps)
  File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 258, in train
    valid_iter, moving_average=self.moving_average)
  File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 314, in validate
    outputs, attns = valid_model(src, tgt, src_lengths)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/onmt/models/model.py", line 47, in forward
    memory_lengths=lengths)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/onmt/decoders/transformer.py", line 226, in forward
    step=step)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/onmt/decoders/transformer.py", line 95, in forward
    attn_type="context")
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/onmt/modules/multi_headed_attn.py", line 202, in forward
    attn = self.softmax(scores).to(query.dtype)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/activation.py", line 982, in forward
    return F.softmax(input, self.dim, _stacklevel=5)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1230, in softmax
    ret = input.softmax(dim)
RuntimeError: CUDA out of memory. Tried to allocate 6.20 GiB (GPU 2; 31.72 GiB total capacity; 16.38 GiB already allocated; 2.92 GiB free; 10.34 GiB cached)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)

  File "/opt/conda/lib/python3.6/site-packages/onmt/decoders/transformer.py", line 226, in forward
    step=step)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/onmt/decoders/transformer.py", line 95, in forward
    attn_type="context")
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)

Edit2: I’m not sure what was going on with that second error, but now I’m consistently getting the one about the tensor sizes not matching up. I’m going to try running without any validation, I’ll see how that goes.

guillaumekln · November 7, 2019, 8:23pm

Most likely you have a very long sentence in your validation set. The position encoding module does not support sequences longer than 5000 tokens.

BaruchG · November 7, 2019, 8:39pm

That’s got to be it. I noticed that in one of the datasets there were some multisentence length lines, longer then any that I had used, which I just left in. Will setting the src_seq_length parameter in the preprocessing.py file take care of that?

vince62s · November 7, 2019, 8:44pm

use this https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/opts.py#L278

BaruchG · November 7, 2019, 9:03pm

I just set it to true and it will take care of everything else automatically without me having to set proper values for sentence length?

vince62s · November 7, 2019, 9:19pm

if you don’t want the default values of 50, you still need to set them.

BaruchG · November 7, 2019, 9:34pm

Sorry, I’m probably misunderstanding but that link is pointing at the --filter_valid flag which at the documentation seems to be a boolean value. Did you mean to point at --src_seq_length which has a default length of 50? Or did you mean that I could set both those flags, but if I set filter_valid to true without setting --src_seq_length it will default to a length of 50? Thanks