When training a model on Ubuntu 16.4 on x86 I do get following:
ari@DeepTinker01:~/OpenNMT$ th train.lua -data ari-data/5300-train.t7 -save_model cv/500x5 -layers 5 -gpuid 1 -report_every 5 -save_every 1000 -max_batch_size 12
Loading data from ‘ari-data/5300-train.t7’…
- vocabulary size: source = 43208; target = 40674
- additional features: source = 0; target = 0
- maximum sequence length: source = 417; target = 793
- number of training sentences: 5348
- maximum batch size: 12
Building model… - using input feeding
Initializing parameters… - number of parameters: 84088674
Preparing memory optimization… - sharing 71% of output/gradInput tensors memory between clones
Start training…
Epoch 1 ; Iteration 5/498 ; Learning rate 1.0000 ; Source tokens/s 21 ; Perplexity 752172.03
Epoch 1 ; Iteration 10/498 ; Learning rate 1.0000 ; Source tokens/s 39 ; Perplexity 1084547.11
Epoch 1 ; Iteration 15/498 ; Learning rate 1.0000 ; Source tokens/s 63 ; Perplexity 5954681.21
Epoch 1 ; Iteration 20/498 ; Learning rate 1.0000 ; Source tokens/s 85 ; Perplexity 14294982.96
Epoch 1 ; Iteration 25/498 ; Learning rate 1.0000 ; Source tokens/s 106 ; Perplexity 6903565.57
Epoch 1 ; Iteration 30/498 ; Learning rate 1.0000 ; Source tokens/s 121 ; Perplexity 8179746.68
Epoch 1 ; Iteration 35/498 ; Learning rate 1.0000 ; Source tokens/s 137 ; Perplexity 7248653.64
Epoch 1 ; Iteration 40/498 ; Learning rate 1.0000 ; Source tokens/s 149 ; Perplexity 6493739.47
Epoch 1 ; Iteration 45/498 ; Learning rate 1.0000 ; Source tokens/s 160 ; Perplexity 4000300.08
/home/ari/torch7/install/bin/luajit: not enough memory
I did face similar issues with same dataset with seq2seq but was able to get over by adding following code to be evaluated at each iteration:
if i % 20 == 0 then
collectgarbage()
end
With OpenNMT it just gives me a few more iterations and error message changes to:
PANIC: unprotected error in call to Lua API (not enough memory)
I’m running the same dataset on IBM POWER8 based box that has 4 P100s connected to CPUs using NVLink and it runs just happily(at the moment on just one GPU). It seems that in IBM power.ai build LuaJIT has significantly more available memory than x86 64bit build (seen this with other projects as well).
As I’m doing most of of the development work on my Ubuntu workstation, it would be nice to have some guidance/ideas how to overcome/work with these LuaJIT memory limitations. I have tried to use lua instead of LuaJIT and it exhibited some quite wierd phenomena using this data set on Seq2seq (not tried with OpenNMT yet)…after eating all 64 Gb RAM it reserved all SWAP and gave me out-of-memory again. Any ideas?