Luajit: not enough memory with long sequences

When training a model on Ubuntu 16.4 on x86 I do get following:
ari@DeepTinker01:~/OpenNMT$ th train.lua -data ari-data/5300-train.t7 -save_model cv/500x5 -layers 5 -gpuid 1 -report_every 5 -save_every 1000 -max_batch_size 12
Loading data from ‘ari-data/5300-train.t7’…

  • vocabulary size: source = 43208; target = 40674
  • additional features: source = 0; target = 0
  • maximum sequence length: source = 417; target = 793
  • number of training sentences: 5348
  • maximum batch size: 12
    Building model…
  • using input feeding
    Initializing parameters…
  • number of parameters: 84088674
    Preparing memory optimization…
  • sharing 71% of output/gradInput tensors memory between clones
    Start training…

Epoch 1 ; Iteration 5/498 ; Learning rate 1.0000 ; Source tokens/s 21 ; Perplexity 752172.03
Epoch 1 ; Iteration 10/498 ; Learning rate 1.0000 ; Source tokens/s 39 ; Perplexity 1084547.11
Epoch 1 ; Iteration 15/498 ; Learning rate 1.0000 ; Source tokens/s 63 ; Perplexity 5954681.21
Epoch 1 ; Iteration 20/498 ; Learning rate 1.0000 ; Source tokens/s 85 ; Perplexity 14294982.96
Epoch 1 ; Iteration 25/498 ; Learning rate 1.0000 ; Source tokens/s 106 ; Perplexity 6903565.57
Epoch 1 ; Iteration 30/498 ; Learning rate 1.0000 ; Source tokens/s 121 ; Perplexity 8179746.68
Epoch 1 ; Iteration 35/498 ; Learning rate 1.0000 ; Source tokens/s 137 ; Perplexity 7248653.64
Epoch 1 ; Iteration 40/498 ; Learning rate 1.0000 ; Source tokens/s 149 ; Perplexity 6493739.47
Epoch 1 ; Iteration 45/498 ; Learning rate 1.0000 ; Source tokens/s 160 ; Perplexity 4000300.08
/home/ari/torch7/install/bin/luajit: not enough memory

I did face similar issues with same dataset with seq2seq but was able to get over by adding following code to be evaluated at each iteration:
if i % 20 == 0 then
collectgarbage()
end
With OpenNMT it just gives me a few more iterations and error message changes to:
PANIC: unprotected error in call to Lua API (not enough memory)

I’m running the same dataset on IBM POWER8 based box that has 4 P100s connected to CPUs using NVLink and it runs just happily(at the moment on just one GPU). It seems that in IBM power.ai build LuaJIT has significantly more available memory than x86 64bit build (seen this with other projects as well).

As I’m doing most of of the development work on my Ubuntu workstation, it would be nice to have some guidance/ideas how to overcome/work with these LuaJIT memory limitations. I have tried to use lua instead of LuaJIT and it exhibited some quite wierd phenomena using this data set on Seq2seq (not tried with OpenNMT yet)…after eating all 64 Gb RAM it reserved all SWAP and gave me out-of-memory again. Any ideas?

I remember this error being quite common with long sequence length on seq2seq-attn.

Maybe this could be avoided (I will look into it) but for now the best and easiest workaround is to use Lua 5.2 to overcome the memory limit and reduce the maximum sequence length to reduce memory constraint.

Also make sure to use the latest version of OpenNMT.

Keep us updated!

Maybe we should switch to tds in the core data code as well? We are also pretty interested in really long inputs/outputs as well.

Apparently max_batch_size has direct impact on luajit memory footprint. With batch size of 6 this works happily on x86 as well. It seems to be quite much slower tough…

Interesting. Batch size should only impact tensor size which does not count towards luajit memory limit.

Hopefully I will manage to reproduce the error consistently and work on that.

As of now, what is the maximum length on the input side ?

In summarization, if we don’t want to be in “1 sentence to 1 sentence” situation, we may need very long input segments (1 paragraph made of several sentences, period separated).

Is there any limitation fro this right now ?

There are no hard coded limits. With two caveats:

  1. You should probably use lua5.2 not luajit until we fix this issue.
  2. GPU memory currently scales linearly in batch size and input length. So you will have to tune down batch for longer examples.

The first is a fixable technical issue, the second is a research issue. We are waiting for the paper that solves it.

Some notes about this issue:

The fundamental issue is still the same: we need to clone the graph up to the maximum sequence length which is obviously very costly memory-wise. According to my tests, each clone introduces a fixed memory overhead that counts towards LuaJIT memory limit (I think this is mostly tables from the nngraph implementation). So for long sequences, using Lua 5.2 is advised.

This is to avoid the not enough memory crash but it does not hide the fact that long sequences consume a lot of memory even with our memory optimizations. On a 8GB GPU (which is an average size nowadays) we are able to train a large model (BRNN, 4 layers, 800 rnn dim., 64 batch size) with a sequence length of 100. It uses around 7GB (with THC_CACHING_ALLOCATOR=0).

For longer sequences, reducing the batch size as @srush suggests is a direct and efficient way to reduce the memory usage.