Training fails on Multi GPU with Torch

ktoetotam · November 14, 2018, 3:59pm

I am trying to run the training as:

th train.lua -data data/demo-train.t7 -save_model model \
        -layers 6 -rnn_size 512 -word_vec_size 512 \
        -dropout 0.1 \
        -max_batch_size 28672  \
        -optim adam  -learning_rate 0.0002 \
        -max_grad_norm 0  \
        -attention global \
        -async_parallel \
        -end_epoch 50 -gpuid 3 5

And keep on getting this exception:

[11/14/18 16:59:05 INFO] Using GPU(s): 3, 5
[11/14/18 16:59:05 WARNING] The caching CUDA memory allocator is enabled. This allocator improves performance at the cost of a higher GPU memory usage. To optimize for memory, consider disabling it by setting the environment variable: THC_CACHING_ALLOCATOR=0
FATAL THREAD PANIC: (read) /opt/torch/share/lua/5.1/torch/File.lua:343: unknown Torch class <Logger>

Training on one GPU works.
Do you know what the problem might be?

Thanks in advance!

ktoetotam · November 15, 2018, 1:35pm

We found the solution, we had to install tds and bit32 in the system scope. OpenNMT seems to have problems with locally installed Lua modules.

guillaumekln · November 19, 2018, 8:45am

Whenever possible, we recommend using the Docker image as it contains everything and is well optimized, both in size and speed.