Out of memory when translating?

Etienne38 · May 20, 2017, 1:51pm

When translating a big file, I get this error:

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6777/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/home/dev8/torch/install/bin/luajit: ./onmt/translate/Beam.lua:128: cuda runtime error (2) : out of memory at 
/tmp/luarocks_cutorch-scm-1-6777/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
	[C]: in function 'gather'
	./onmt/translate/Beam.lua:128: in function 'func'
	./onmt/utils/Tensor.lua:12: in function 'recursiveApply'
	./onmt/utils/Tensor.lua:7: in function 'selectBeam'
	./onmt/translate/Beam.lua:407: in function '_nextState'
	./onmt/translate/Beam.lua:395: in function '_nextBeam'
	./onmt/translate/BeamSearcher.lua:116: in function '_findKBest'
	./onmt/translate/BeamSearcher.lua:68: in function 'search'
	./onmt/translate/Translator.lua:251: in function 'translateBatch'
	./onmt/translate/Translator.lua:337: in function 'translate'
	translate.lua:101: in function 'main'
	translate.lua:182: in main chunk
	[C]: in function 'dofile'
	...dev8/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x00405d50

I never get this error before. The only real difference with many previous experiments is the language ZH to EN. This model is a 2x1000 with embeddings size 200. It was a fast 5 epochs built over 7M MultiUN sentences. It’s a 64G RAM machine, with GTX1080Ti 11G. I don’t see a reason why this out of memory occur.

The error seems in the beam code. I can keep a low quality result. So I’m now trying with beam_size=1 (no special parameter was used when the error occurred). Of course, it’s faster. Hope this will run till the end…

jean.senellart · May 21, 2017, 7:09am

Hi Etienne, first time I see that too. on which version are you running it? would it be possible to put model and minimal test set to reproduce somewhere?
Thanks!

tel34 · May 21, 2017, 10:57am

Hi Etienne,
What was the rationale for choosing 2x1000? I’m interested in correlating difference model configurations with translation quality.
Terence

Etienne38 · May 22, 2017, 1:09pm

Hi Jean,

As soon as my GPU will be free again, I will try to reproduce on an isolated sentence.

Hi Terence,

When started with ONMT few months ago, I did some tests with different kinds of configurations. Two layers with 1000 cells provided me good results on 2 or 3 M sentences. I’m using this configuration since that. But, this is not a strong motivated choice. Now, with a bit more experience, I have to test again some others. I hope I will be able to get more GPU in a near future…

guillaumekln · May 22, 2017, 3:16pm

What version are you using again?

Etienne38 · May 22, 2017, 4:03pm

I’m using ONMT V0.6.0

Etienne38 · May 23, 2017, 5:42pm

By chance I succeeded in reproducing the problem with a short input file. All is here to test:

I was simply translating with the replace_unk option, on a GTX1080Ti card.

guillaumekln · May 23, 2017, 6:21pm

That is actually really helpful. I reproduced the issue and will look into it.

guillaumekln · May 23, 2017, 6:29pm

The sentence at line 33 in your file is gigantic though. We don’t enforce a limit on the source side as it would be very arbitrary, but 5000+ words is certainly too much.

Etienne38 · May 23, 2017, 6:54pm

It’s from the MultiUN data set. I already filtered it with several rules. I should also add a constraint about the size…

PS: of course, the long line could have been filtered. But, crashing is not a good behaviour from ONMT translator. The whole file was translated properly with beam_size=1. Suggestion : have a size threshold, like 200 words, and for each line larger than this threshold use beam_size=1 to translate safely.