torch version 1.6.0+cu101
CUDA version: 10.1
nvidia driver 418.56
ubuntu18.04
here’s my train configure
-data ~~
-save_model ~~
-layers 6
-rnn_size 512
-word_vec_size 512
-transformer_ff 2048
-heads 8
-encoder_type transformer
-decoder_type transformer
-position_encoding
-train_steps 100000
-max_generator_batches 2
-dropout 0.1
-batch_size 1024
-batch_type tokens
-normalization tokens
-accum_count 2
-optim adam
-adam_beta2 0.998
-decay_method noam
-warmup_steps 8000
-learning_rate 2
-max_grad_norm 0
-param_init 0
-param_init_glorot
-label_smoothing 0.1
-valid_steps 1000
-save_checkpoint_steps 1000
-log_file data/${domain}/log/trn.date +%y%m%d_%H%M%S
.log
-log_file_level INFO
-exp data/${domain}/exp.txt
-early_stopping 6
-early_stopping_criteria ppl
-tensorboard
-tensorboard_log_dir runs/onmt
-world_size 4
-gpu_ranks 0 1 2 3
The point is gpu problem
When world_size 1 and gpu_ranks 0, train command works
However, when world_size 4 and gpu_ranks 0 1 2 3, I got a following error message
only works use one gpu 0 if i use multi-gpu then didn’t works
File “/data/home/asr/.local/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 605, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
import torch
torch.cuda.is_available() => TRUE
I saw all gpu available in torch using simple code
Could i hear some hint for this problem?