Out of memory: training 2 mn sentences with length cap 500

I’m training on a 2million dataset with a sentence length cap of 500. Using lua5.2 as suggested elsewhere.

Wanted to know what are the limits in terms of sizes of models that can be trained? For my current problem should I go down from a batch size of 64 to lower values till I’m not out of memory?

[03/01/17 16:10:18 INFO] Using 8 threads on 8 GPUs
[03/01/17 16:10:19 WARNING] For improved efficiency in nparallel mode - do install nccl
[03/01/17 16:10:19 INFO] Loading data from 'filename masked out'...
[03/01/17 16:13:22 INFO]  * vocabulary size: source = 50004; target = 6
[03/01/17 16:13:22 INFO]  * additional features: source = 0; target = 0
[03/01/17 16:13:22 INFO]  * maximum sequence length: source = 457; target = 458
[03/01/17 16:13:22 INFO]  * number of training sentences: 2269951
[03/01/17 16:13:22 INFO]  * maximum batch size: 64
[03/01/17 16:13:22 INFO] Building model...
[03/01/17 16:13:25 INFO]  * using input feeding
[03/01/17 16:14:25 INFO] Initializing parameters...
[03/01/17 16:14:32 INFO]  * number of parameters: 39364558
[03/01/17 16:14:32 INFO] Preparing memory optimization...
[03/01/17 16:14:33 INFO]  * sharing 70% of output/gradInput tensors memory between clones
[03/01/17 16:15:00 INFO] Start training...
[03/01/17 16:15:00 INFO]
THCudaCheck FAIL file=/distro/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/distro/install/bin/lua: /distro/install/share/lua/5.2/threads/threads.lua:183: [thread 4 callback] /distro/install/share/lua/5.2/nngraph/nesting.lua:34: cuda
 runtime error (2) : out of memory at /distro/extra/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
        [C]: in function 'resizeAs'
        /distro/install/share/lua/5.2/nngraph/nesting.lua:34: in function 'resizeNestedAs'
        /distro/install/share/lua/5.2/nngraph/gmodule.lua:37: in function 'getTotalGradOutput'
        /distro/install/share/lua/5.2/nngraph/gmodule.lua:404: in function 'neteval'
        /distro/install/share/lua/5.2/nngraph/gmodule.lua:454: in function 'updateGradInput'
        ./onmt/modules/Network.lua:16: in function 'updateGradInput'
        /distro/install/share/lua/5.2/nngraph/gmodule.lua:420: in function 'neteval'
        /distro/install/share/lua/5.2/nngraph/gmodule.lua:454: in function 'updateGradInput'
        /distro/install/share/lua/5.2/nn/Module.lua:31: in function 'backward'
        ./onmt/modules/Decoder.lua:348: in function 'backward'
        train.lua:252: in function 'trainNetwork'
        train.lua:284: in function <train.lua:275>
        (...tail calls...)
        [C]: in function 'xpcall'
        /distro/install/share/lua/5.2/threads/threads.lua:234: in function 'callback'

What options are you using?

A sequence length of 500 is indeed very demanding memory-wise and the limit is your GPU memory.

You should try a small batch size like 16. Also prefix your command line with THC_CACHING_ALLOCATOR=0, it should help a bit.

This is the call I’m making to subprocess.Popen() in python. It has the options.

[u'th', u'train.lua', u'-layers', u'2', u'-rnn_size', u'512', u'-max_batch_size', u'32', u'-optim', u'sgd', u'-brnn', u'-learning_rate', u'0.6', u'-word_vec_size', u'500', u'-gpuid', u'1', u'-end_epoch', u'20' u'-dropout', u'0.3', u'-nparallel', u'8', u'-learning_rate_decay', u'0.5']

since you’re using subprocess to execute your torch, you’d want to put the following line before the subprocess.Popen(), assuming you’ve imported os:
os.environ[‘THC_CACHING_ALLOCATOR’] = ‘0’

500 tokens is indeed a very long sentence. :astonished:

Also, it’s definitely worth it to install nccl. I saw a considerable increase in performance on a 3 GPU machine. First, install and build the code from https://github.com/NVIDIA/nccl, then use luarocks install nccl

1 Like

Thanks @dbl. Will try out nccl :thumbsup: