RuntimeError: CUDA error: an illegal memory access was encountered

Hi there,

I am trying to train Image2Text v1.0.0.rc1 model using my own dataset of size 10M. I have got “RuntimeError: CUDA error: an illegal memory access was encountered” error while trying to train the dataset keeping shard size 500. Then I have tried again after reducing the shard size to 100 but still encountered same error. Doing some googling, I have found that torch.cuda.empty_cache() could be used. May I request you to tell me if there any other thing that I should take care of or what could be the plausible cause of the error?

Traceback (most recent call last):
  File "/opt/automates_venv/bin/onmt_train", line 33, in <module>
    sys.exit(load_entry_point('OpenNMT-py==1.2.0', 'console_scripts', 'onmt_train')())
  File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/bin/train.py", line 197, in main
    train(opt)
  File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/bin/train.py", line 95, in train
    single_main(opt, 0)
  File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/train_single.py", line 145, in main
    trainer.train(
  File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/trainer.py", line 279, in train
    valid_stats = self.validate(
  File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/trainer.py", line 339, in validate
    outputs, attns = valid_model(src, tgt, src_lengths,
  File "/opt/automates_venv/lib/python3.8/site-packages/torch-1.8.1-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/models/model.py", line 45, in forward
    enc_state, memory_bank, lengths = self.encoder(src, lengths)
  File "/opt/automates_venv/lib/python3.8/site-packages/torch-1.8.1-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/automates_venv/lib/python3.8/site-packages/OpenNMT_py-1.2.0-py3.8.egg/onmt/encoders/image_encoder.py", line 98, in forward
    src = F.relu(self.layer4(src), True)
  File "/opt/automates_venv/lib/python3.8/site-packages/torch-1.8.1-py3.8-linux-x86_64.egg/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/automates_venv/lib/python3.8/site-packages/torch-1.8.1-py3.8-linux-x86_64.egg/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/automates_venv/lib/python3.8/site-packages/torch-1.8.1-py3.8-linux-x86_64.egg/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: an illegal memory access was encountered