Translation crash on GPU after thousands of succesful translations

paulkp · April 25, 2020, 2:11am

I’m translating a large input via GPU, 1 job per GPU, (via a transformer model trained on wmt14-ende).
After eg 5000 or - on another file - 100,000 successful translations I get this lovely error and crash:

tail logs/w14ende_2s1.01.log
emb = self.embeddings(src)
File “/home/paul/.local/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/home/paul/.local/lib/python3.5/site-packages/onmt/modules/embeddings.py”, line 273, in forward
source = module(source, step=step)
File “/home/paul/.local/lib/python3.5/site-packages/torch/nn/modules/module.py”, line 532, in call
result = self.forward(*input, **kwargs)
File “/home/paul/.local/lib/python3.5/site-packages/onmt/modules/embeddings.py”, line 50, in forward
emb = emb + self.pe[:emb.size(0)]
RuntimeError: The size of tensor a (11450) must match the size of tensor b (5000) at non-singleton dimension 0

What’s going on?
It fails on input line 5371 which seems fine:

5370 ▁Play ▁at ▁Cra ps . com ▁and ▁experience ▁a ▁whole ▁new ▁world ▁of ▁exciting ▁online ▁gaming , ▁with ▁revolutionary ▁and ▁striking ▁graphics , ▁like ▁no ▁other ▁casino .
5371 ▁This ▁sexy ▁online ▁casino ▁features ▁over ▁100 ▁cutting ▁edge ▁games , ▁including ▁19 ▁Progressive ▁Games ▁and ▁11 ▁Bonus ▁Games .
5372 ▁FAQ : ▁What ▁if ▁my ▁internet ▁goes ▁down ▁while ▁I ▁am ▁playing ▁at ▁Europa ▁Casino ?

Maybe it’s too bashful . .

What’s going on?

GPU #1 is the one that will fail at line 5371

±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.64 Driver Version: 430.64 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… Off | 00000A30:00:00.0 Off | 0 |
| N/A 32C P0 41W / 250W | 12836MiB / 16160MiB | 27% Default |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-PCIE… Off | 00009C41:00:00.0 Off | 0 |
| N/A 32C P0 62W / 250W | 5342MiB / 16160MiB | 40% Default |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-PCIE… Off | 0000C2D2:00:00.0 Off | 0 |
| N/A 34C P0 92W / 250W | 5854MiB / 16160MiB | 57% Default |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-PCIE… Off | 0000E7CE:00:00.0 Off | 0 |
| N/A 33C P0 84W / 250W | 12526MiB / 16160MiB | 43% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 91829 C /usr/bin/python3 12825MiB |
| 1 95067 C /usr/bin/python3 5331MiB |
| 2 93568 C /usr/bin/python3 5843MiB |
| 3 93933 C /usr/bin/python3 12515MiB |
±----------------------------------------------------------------------------+

Just before it (reproducibly) crashes the #1 GPU memory use goes slightly up but not as high GPU’s #1 or #3 which don’t crash on their data segment I gave them:

| 1 95544 C /usr/bin/python3 7287MiB |

vince62s · April 25, 2020, 7:01am