My model crashes when it tries to validate itself during training. I get the following error (I can copy above if it helps, but this is all that seems relevant):
[2019-11-06 20:28:37,784 INFO] Step 10000/200000; acc: 55.44; ppl: 8.81; xent: 2.18; lr: 0.00088; 55214/63742 tok/s; 4501 sec
[2019-11-06 20:28:37,786 INFO] Loading dataset from fa2en/fa2en.valid.0.pt
[2019-11-06 20:28:37,896 INFO] number of examples: 5000
Traceback (most recent call last):
File "/opt/conda/bin/onmt_train", line 10, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 200, in main
train(opt)
File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 82, in train
p.join()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 124, in join
res = self._popen.wait(timeout)
File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 184, in signal_handler
raise Exception(msg)
Exception:
-- Tracebacks above this line can probably
be ignored --
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 142, in run
single_main(opt, device_id, batch_queue, semaphore)
File "/opt/conda/lib/python3.6/site-packages/onmt/train_single.py", line 143, in main
valid_steps=opt.valid_steps)
File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 258, in train
valid_iter, moving_average=self.moving_average)
File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 314, in validate
outputs, attns = valid_model(src, tgt, src_lengths)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/models/model.py", line 42, in forward
enc_state, memory_bank, lengths = self.encoder(src, lengths)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/encoders/transformer.py", line 121, in forward
emb = self.embeddings(src)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/modules/embeddings.py", line 273, in forward
source = module(source, step=step)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/modules/embeddings.py", line 50, in forward
emb = emb + self.pe[:emb.size(0)]
RuntimeError: The size of tensor a (6376) must match the size of tensor b (5000) at non-singleton dimension 0
The length of my validation files are both 5000 lines (I double checked after I got this error), and I assume that is what “tensor b” is over there. I’m using version 1.0.0rc2 of OpenNMT-py and this is the first time that I’ve used the pip installed version, but it has worked perfectly in the past on a git cloned version. I used sentencepiece for tokenizing/encoding. The command that I used for preprocessing is:
onmt_preprocess -train_src full/src-train.txt.token -train_tgt full/tgt-train.txt.token -valid_src full/src-validation.txt.token -valid_tgt full/tgt-validation.txt.token -save_data fa2en/fa2en
.
The command that I used for training is:
export CUDA_VISIBLE_DEVICES=0,1,2,3
onmt_train -data fa2en/fa2en -save_model models/fa-en \ -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 \ -encoder_type transformer -decoder_type transformer -position_encoding \ -train_steps 200000 -max_generator_batches 2 -dropout 0.1 \ -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 \ -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 \ -max_grad_norm 0 -param_init 0 -param_init_glorot \ -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 \ -world_size 4 -gpu_ranks 0 1 2 3 2>> output.txt
Thanks for your help!
EDIT: Just tried running again, and now I’m getting a memory issue. I’ve run this many times in the past without issues and it has 4 32 gb gpu’s available to it. Here’s the error:
[2019-11-07 18:10:08,103 INFO] Loading dataset from fa2en3/fa2en.train.3.pt
[2019-11-07 18:10:14,380 INFO] Step 950/200000; acc: 31.78; ppl: 48.07; xent: 3.87; lr: 0.00012; 52493/67860 tok/s; 443 sec
[2019-11-07 18:10:24,312 INFO] number of examples: 999636
[2019-11-07 18:10:35,701 INFO] Step 1000/200000; acc: 33.13; ppl: 42.99; xent: 3.76; lr: 0.00012; 51367/68638 tok/s; 464 sec
[2019-11-07 18:10:35,704 INFO] Loading dataset from fa2en3/fa2en.valid.0.pt
[2019-11-07 18:10:35,803 INFO] number of examples: 5000
Traceback (most recent call last):
File "/opt/conda/bin/onmt_train", line 10, in <module>
Process SpawnProcess-2:
sys.exit(main())
File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 200, in main
train(opt)
File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 82, in train
p.join()
File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 124, in join
res = self._popen.wait(timeout)
File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
Traceback (most recent call last):
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 142, in run
single_main(opt, device_id, batch_queue, semaphore)
File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
File "/opt/conda/lib/python3.6/site-packages/onmt/train_single.py", line 143, in main
valid_steps=opt.valid_steps)
pid, sts = os.waitpid(self.pid, flag)
File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 258, in train
valid_iter, moving_average=self.moving_average)
File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 184, in signal_handler
File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 314, in validate
outputs, attns = valid_model(src, tgt, src_lengths)
raise Exception(msg)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/models/model.py", line 47, in forward
memory_lengths=lengths)
Exception:
-- Tracebacks above this line can probably
be ignored --
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/onmt/bin/train.py", line 142, in run
single_main(opt, device_id, batch_queue, semaphore)
File "/opt/conda/lib/python3.6/site-packages/onmt/train_single.py", line 143, in main
valid_steps=opt.valid_steps)
File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 258, in train
valid_iter, moving_average=self.moving_average)
File "/opt/conda/lib/python3.6/site-packages/onmt/trainer.py", line 314, in validate
outputs, attns = valid_model(src, tgt, src_lengths)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/models/model.py", line 47, in forward
memory_lengths=lengths)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/decoders/transformer.py", line 226, in forward
step=step)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/decoders/transformer.py", line 95, in forward
attn_type="context")
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/modules/multi_headed_attn.py", line 202, in forward
attn = self.softmax(scores).to(query.dtype)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/activation.py", line 982, in forward
return F.softmax(input, self.dim, _stacklevel=5)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py", line 1230, in softmax
ret = input.softmax(dim)
RuntimeError: CUDA out of memory. Tried to allocate 6.20 GiB (GPU 2; 31.72 GiB total capacity; 16.38 GiB already allocated; 2.92 GiB free; 10.34 GiB cached)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/decoders/transformer.py", line 226, in forward
step=step)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/onmt/decoders/transformer.py", line 95, in forward
attn_type="context")
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
Edit2: I’m not sure what was going on with that second error, but now I’m consistently getting the one about the tensor sizes not matching up. I’m going to try running without any validation, I’ll see how that goes.