OpenNMT Forum

OpenNMT-py RuntimeError: CUDA error: unknown error

When I train Transformer use by OpenNMT pytorch error is occur…

error

Process SpawnProcess-9:
Traceback (most recent call last):
File “/usr/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/data/home/chanjun_park/OpenNMT-py/train.py”, line 127, in batch_producer
q.put(b, False)
File “/usr/lib/python3.6/multiprocessing/queues.py”, line 83, in put
raise Full
queue.Full
Traceback (most recent call last):
File “…/…/train.py”, line 196, in
main(opt)
File “…/…/train.py”, line 78, in main
p.join()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 124, in join
res = self._popen.wait(timeout)
File “/usr/lib/python3.6/multiprocessing/popen_fork.py”, line 50, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
File “/usr/lib/python3.6/multiprocessing/popen_fork.py”, line 28, in poll
pid, sts = os.waitpid(self.pid, flag)
File “…/…/train.py”, line 180, in signal_handler
raise Exception(msg)
Exception:

– Tracebacks above this line can probably
be ignored –

Traceback (most recent call last):
File “/data/home/chanjun_park/OpenNMT-py/train.py”, line 138, in run
single_main(opt, device_id, batch_queue, semaphore)
File “/data/home/chanjun_park/OpenNMT-py/onmt/train_single.py”, line 139, in main
valid_steps=opt.valid_steps)
File “/data/home/chanjun_park/OpenNMT-py/onmt/trainer.py”, line 224, in train
self._accum_batches(train_iter)):
File “/data/home/chanjun_park/OpenNMT-py/onmt/trainer.py”, line 162, in _accum_batches
for batch in iterator:
File “/data/home/chanjun_park/OpenNMT-py/onmt/train_single.py”, line 116, in _train_iter
batch = batch_queue.get()
File “/usr/lib/python3.6/multiprocessing/queues.py”, line 113, in get
return _ForkingPickler.loads(res)
File “/data/home/chanjun_park/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py”, line 109, in rebuild_cuda_tensor
event_sync_required)
RuntimeError: CUDA error: unknown error

Do any have a solution?

please provide full config and post your command line.

command

nohup python3 …/…/train.py -data ./data -save_model ./model/model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 500000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -world_size 8 -gpu_rank 0 1 2 3 4 5 6 7 -log_file ./log &

I set gpu mode like below

sudo nvidia-smi -c 0

can you try to set batch_size to 2048 just make sure it is not a memory issue.
are you using cuda 10 and nccl 2.4.2,pytorch 1.1 ?

I change my batch_size to 2048 but same error is occur…

pytorch => 1.1.0
cuda => 10.1

I think the pytorch binary is compiled with cuda 10, not 10.1
what did you do, compile pytorch yourself ?

I did not compile pytorch myself…

plus when i set

nvidia-smi -c 3

error occur

Process SpawnProcess-9:
Traceback (most recent call last):
File “/usr/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/data/home/chanjun_park/OpenNMT-py/train.py”, line 110, in batch_producer
b = next_batch(0)
File “/data/home/chanjun_park/OpenNMT-py/train.py”, line 106, in next_batch
new_batch = next(generator_to_serve)
File “/data/home/chanjun_park/OpenNMT-py/onmt/inputters/inputter.py”, line 728, in iter
for batch in self._iter_dataset(path):
File “/data/home/chanjun_park/OpenNMT-py/onmt/inputters/inputter.py”, line 710, in _iter_dataset
for batch in cur_iter:
File “/data/home/chanjun_park/OpenNMT-py/onmt/inputters/inputter.py”, line 601, in iter
self.device)
File “/data/home/chanjun_park/.local/lib/python3.6/site-packages/torchtext/data/batch.py”, line 34, in init
setattr(self, name, field.process(batch, device=device))
File “/data/home/chanjun_park/OpenNMT-py/onmt/inputters/text_dataset.py”, line 121, in process
base_data = self.base_field.process(batch_by_feat[0], device=device)
File “/data/home/chanjun_park/.local/lib/python3.6/site-packages/torchtext/data/field.py”, line 237, in process
tensor = self.numericalize(padded, device=device)
File “/data/home/chanjun_park/.local/lib/python3.6/site-packages/torchtext/data/field.py”, line 332, in numericalize
lengths = torch.tensor(lengths, dtype=self.dtype, device=device)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable

do not use -c 3 because the new architecture requires non exclusive mode.

but you need to install cuda 10.0 instead of 10.1
this is likely the cause of your error.

Okay I will try it !

But when I did the training last week, there is no problems.
Even now the training is going to die at about 500 step.

[2019-06-10 10:15:34,371 INFO] encoder: 35300352
[2019-06-10 10:15:34,371 INFO] decoder: 58029316
[2019-06-10 10:15:34,371 INFO] * number of parameters: 93329668
[2019-06-10 10:15:34,374 INFO] Starting training on GPU: [0, 1, 2, 3, 4, 5, 6, 7]
[2019-06-10 10:15:34,374 INFO] Start training loop and validate every 10000 steps…
[2019-06-10 10:15:36,343 INFO] number of examples: 997817
[2019-06-10 10:16:45,061 INFO] Step 50/500000; acc: 2.23; ppl: 7732.87; xent: 8.95; lr: 0.00001; 18639/21044 tok/s; 71 sec
[2019-06-10 10:17:28,881 INFO] Step 100/500000; acc: 8.63; ppl: 5661.52; xent: 8.64; lr: 0.00001; 30298/34007 tok/s; 115 sec
[2019-06-10 10:18:12,817 INFO] Step 150/500000; acc: 8.99; ppl: 3393.42; xent: 8.13; lr: 0.00002; 29779/33780 tok/s; 158 sec
[2019-06-10 10:18:57,171 INFO] Step 200/500000; acc: 8.53; ppl: 1548.03; xent: 7.34; lr: 0.00002; 30061/33553 tok/s; 203 sec
Process SpawnProcess-9:
Traceback (most recent call last):
File “/usr/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/data/home/chanjun_park/OpenNMT-py/train.py”, line 128, in batch_producer
q.put(b, False)
File “/usr/lib/python3.6/multiprocessing/queues.py”, line 83, in put
raise Full
queue.Full
[2019-06-10 10:19:41,300 INFO] Step 250/500000; acc: 8.81; ppl: 684.52; xent: 6.53; lr: 0.00003; 29745/33710 tok/s; 247 sec
^C
chanjun_park@mm-8gpu03:~/OpenNMT-py/ko-ko/English_Grammar$ tail -f nohup.out
File “/usr/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/data/home/chanjun_park/OpenNMT-py/train.py”, line 128, in batch_producer
q.put(b, False)
File “/usr/lib/python3.6/multiprocessing/queues.py”, line 83, in put
raise Full
queue.Full
[2019-06-10 10:19:41,300 INFO] Step 250/500000; acc: 8.81; ppl: 684.52; xent: 6.53; lr: 0.00003; 29745/33710 tok/s; 247 sec
^C
chanjun_park@mm-8gpu03:~/OpenNMT-py/ko-ko/English_Grammar$ watch nvidia-smi
chanjun_park@mm-8gpu03:~/OpenNMT-py/ko-ko/English_Grammar$ tail -f nohup.out
File “/data/home/chanjun_park/OpenNMT-py/onmt/trainer.py”, line 162, in _accum_batches
for batch in iterator:
File “/data/home/chanjun_park/OpenNMT-py/onmt/train_single.py”, line 116, in _train_iter
batch = batch_queue.get()
File “/usr/lib/python3.6/multiprocessing/queues.py”, line 113, in get
return _ForkingPickler.loads(res)
File “/data/home/chanjun_park/.local/lib/python3.6/site-packages/torch/multiprocessing/reductions.py”, line 109, in rebuild_cuda_tensor
event_sync_required)
RuntimeError: CUDA error: unknown error

then git pull, you must have the same error as this https://github.com/OpenNMT/OpenNMT-py/issues/1454

Thanks I solve the problem.
Thanks a lot for your help!