OpenNMT Forum

Several errors after hours of training - OpenNMT-py

Newbie here, trying to set up the default OpenNMT-py EN-DE on my pc.
I fixed several errores/problems on my own with some research and finally got to testing the files with my GPU (which was supposed to finish tomorrow by ~11:00 AM), but I can’t seem to fix this.

However, a few minutes ago I got several errors:

[2020-02-02 23:58:11,951 INFO] number of examples: 403
        Traceback (most recent call last):File "C:\Users\?\OpenNMT-py\onmt\trainer.py", line 377, in _gradient_accumulation trunc_size=trunc_size)
          File "C:\Users\?\OpenNMT-py\onmt\utils\loss.py", line 165, in __call__ for shard in shards(shard_state, shard_size):
          File "C:\Users\?\OpenNMT-py\onmt\utils\loss.py", line 381, in shards torch.autograd.backward(inputs, grads)
          File "C:\Users\?\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\autograd\__init__.py", line 99, in backward allow_unreachable=True)  # allow_unreachable flag
        RuntimeError: CUDA error: unspecified launch failure
        [2020-02-02 23:58:14,929 INFO] At step 32541, we removed a batch - accum 0
        Traceback (most recent call last):
          File "train.py", line 6, in <module> main()
          File "C:\Users\?\OpenNMT-py\onmt\bin\train.py", line 204, in main train(opt)
          File "C:\Users\?\OpenNMT-py\onmt\bin\train.py", line 88, in train single_main(opt, 0)
          File "C:\Users\?\OpenNMT-py\onmt\train_single.py", line 143, in main valid_steps=opt.valid_steps)
          File "C:\Users\?\OpenNMT-py\onmt\trainer.py", line 244, in train report_stats)
          File "C:\Users\?\OpenNMT-py\onmt\trainer.py", line 399, in _gradient_accumulation self.optim.step()
          File "C:\Users\?\OpenNMT-py\onmt\utils\optimizers.py", line 359, in step clip_grad_norm_(group['params'], self._max_grad_norm)
          File "C:\Users\?\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\nn\utils\clip_grad.py", line 32, in clip_grad_norm_ param_norm = p.grad.data.norm(norm_type)
          File "C:\Users\?\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\tensor.py", line 339, in norm return torch.norm(self, p, dim, keepdim, dtype=dtype)
          File "C:\Users\?\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\functional.py", line 747, in norm return torch._C._VariableFunctions.norm(input, p)
        RuntimeError: CUDA error: unspecified launch failure
         
            C:\Users\?\OpenNMT-py>.git
            '.git' is not recognized as an internal or external command,
            operable program or batch file.

I trained the model deleting some data from the files because it was too huge and I just wanted to test it with fewer sentences (to try and see if I got it to work and later on use other data with a EN-ES language pair). I trained it using:

onmt_train -data data/demo -save_model demo-model

I continued on to execute the translation and it worked. Obviously, the results are really bad but it did work.

Anyone can help? Thank you beforehand. :smiley: