Hi there,
I am trying to train Image2Text v1.0.0.rc1 model using my own dataset of size 10M. I have got “RuntimeError: CUDA error: an illegal memory access was encountered” error while trying to train the dataset keeping shard size 500. Then I have tried again after reducing the shard size to 100 but still encountered same error. Doing some googling, I have found that torch.cuda.empty_cache() could be used. May I request you to tell me if there any other thing that I should take care of or what could be the plausible cause of the error?
Traceback (most recent call last):
File "/opt/automates_venv/bin/onmt_train", line 33, in <module>
sys.exit(load_entry_point('OpenNMT-py==1.2.0', 'console_scripts', 'onmt_train')())
File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/bin/train.py", line 197, in main
train(opt)
File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/bin/train.py", line 95, in train
single_main(opt, 0)
File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/train_single.py", line 145, in main
trainer.train(
File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/trainer.py", line 279, in train
valid_stats = self.validate(
File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/trainer.py", line 339, in validate
outputs, attns = valid_model(src, tgt, src_lengths,
File "/opt/automates_venv/lib/python3.8/site-packages/torch-1.8.1-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/models/model.py", line 45, in forward
enc_state, memory_bank, lengths = self.encoder(src, lengths)
File "/opt/automates_venv/lib/python3.8/site-packages/torch-1.8.1-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/encoders/image_encoder.py", line 98, in forward
src = F.relu(self.layer4(src), True)
File "/opt/automates_venv/lib/python3.8/site-packages/torch-1.8.1-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/automates_venv/lib/python3.8/site-packages/torch-1.8.1-py3.8-linux-x86_64.egg/torch/nn/modules/conv.py", line 399, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/automates_venv/lib/python3.8/site-packages/torch-1.8.1-py3.8-linux-x86_64.egg/torch/nn/modules/conv.py", line 395, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: an illegal memory access was encountered