I’m training on a 2million dataset with a sentence length cap of 500. Using lua5.2 as suggested elsewhere.
Wanted to know what are the limits in terms of sizes of models that can be trained? For my current problem should I go down from a batch size of 64 to lower values till I’m not out of memory?
[03/01/17 16:10:18 INFO] Using 8 threads on 8 GPUs
[03/01/17 16:10:19 WARNING] For improved efficiency in nparallel mode - do install nccl
[03/01/17 16:10:19 INFO] Loading data from 'filename masked out'...
[03/01/17 16:13:22 INFO] * vocabulary size: source = 50004; target = 6
[03/01/17 16:13:22 INFO] * additional features: source = 0; target = 0
[03/01/17 16:13:22 INFO] * maximum sequence length: source = 457; target = 458
[03/01/17 16:13:22 INFO] * number of training sentences: 2269951
[03/01/17 16:13:22 INFO] * maximum batch size: 64
[03/01/17 16:13:22 INFO] Building model...
[03/01/17 16:13:25 INFO] * using input feeding
[03/01/17 16:14:25 INFO] Initializing parameters...
[03/01/17 16:14:32 INFO] * number of parameters: 39364558
[03/01/17 16:14:32 INFO] Preparing memory optimization...
[03/01/17 16:14:33 INFO] * sharing 70% of output/gradInput tensors memory between clones
[03/01/17 16:15:00 INFO] Start training...
[03/01/17 16:15:00 INFO]
THCudaCheck FAIL file=/distro/extra/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/distro/install/bin/lua: /distro/install/share/lua/5.2/threads/threads.lua:183: [thread 4 callback] /distro/install/share/lua/5.2/nngraph/nesting.lua:34: cuda
runtime error (2) : out of memory at /distro/extra/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'resizeAs'
/distro/install/share/lua/5.2/nngraph/nesting.lua:34: in function 'resizeNestedAs'
/distro/install/share/lua/5.2/nngraph/gmodule.lua:37: in function 'getTotalGradOutput'
/distro/install/share/lua/5.2/nngraph/gmodule.lua:404: in function 'neteval'
/distro/install/share/lua/5.2/nngraph/gmodule.lua:454: in function 'updateGradInput'
./onmt/modules/Network.lua:16: in function 'updateGradInput'
/distro/install/share/lua/5.2/nngraph/gmodule.lua:420: in function 'neteval'
/distro/install/share/lua/5.2/nngraph/gmodule.lua:454: in function 'updateGradInput'
/distro/install/share/lua/5.2/nn/Module.lua:31: in function 'backward'
./onmt/modules/Decoder.lua:348: in function 'backward'
train.lua:252: in function 'trainNetwork'
train.lua:284: in function <train.lua:275>
(...tail calls...)
[C]: in function 'xpcall'
/distro/install/share/lua/5.2/threads/threads.lua:234: in function 'callback'