We switched recently from OpenNMT 0.3 to OpenNMT 0.7. Initial tests were very promising, meaning that everything was running smoothly. Initial tests, of course, were small models that can be built quite quickly. However, when we started building models with bigger datasets there were problems - the training will crash due to ‘out-of-memory’ problems.
The machine which we used for the builds that failed has the K520 GPU with 4GB of GRAM. I understand that maybe it is a small amount for NMT, but my major concern is that before switching to OpenNMT 0.7 (still with 0.3) we were able to build models that later failed.
Batch size is the default size; the max sentence length is 175; ADAM is the learning optimizer; no intermediate (checkout) models are saved but only after an epoch is finished.
Anyone has any idea what the problem is?
Thanks in advance.
Do you have the caching CUDA memory allocator enabled?
This is, can you see this warning when starting your training?
[06/20/17 16:26:01 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0
Switching off this can help you optimizing your memory usage. It can make a big difference.
I also have noticed that, opennmt needs more memory when using adam than when using sgd (although I don’t know why…), have you tried to change the optimizer algorithm?
It sometimes worked for me.
Just for your information, our trainings on big datasets (>2.000.000 sentences) use ~2.3GB of GRAM when using the default parameter values.
when you say:
what do you mean? do they fail when translating? or when continuing a training?
Sometimes just updating your torch installation solves everything (in particular update nn, cutorch, dpnn, cudnn, …)
I know it is not a specific solution but I hope this can help.
Thanks for the reply. I was considering updating nn, tourch and etc.
we were able to build models that later failed.
I meant that with 0.3 we were able to build the models with the same data that with 0.7 failed to build (due to the memory).
Thanks once again. I will check also the cache.
Hi @dimitarsh1, normally memory usage should be about the same between 0.3 and 0.7 if you have not updated torch in between. I will run a benchmark to see if I can reproduce. As a workaround, I think you can safely change max sequence length for training - it has probably a marginal impact on training quality while increasing a lot the mem usage.
I will reduce further the max sequence (initially with 0.3 it was 200 and then I reduced it to 175) and will try again.