First I have to say that what you guys have done is amazing. I’ve been playing with OpenNMT few weeks now and it works great with small datasets. Great job!
I’ve experienced few issues with bigger datasets and multi-gpu training. If these are not known issues, I can also create more detailed issues later when I’ve some time (unfortunately I forgot to copy my logs before terminating my instance). Most of these probably are not directly OpenNMT issues but maybe it makes sense to see whether there is a workaround for these.
Running train.lua with LUAJIT I got fatal errors (out of memory) from multiple threads. I guess this is the max 2gb table limit in LUAJIT but I’m far from LUA expert so this is just a guess. I’ve used AWS p2.8xlarge (a lot of memory and 8 gpu’s so hardware shouldn’t be the issue here). I read somewhere that instead of tables, you could use torch tensors which supposedly don’t have this limit.
Loading model with LUA 5.2 is slow. LUAJIT is a tad faster but not enough. I’m not sure is this OpenNMT issue but it takes almost an hour to load my training model. I’m not familiar how torch / OpenNMT loads these so I have no idea whether this could be optimized. Is it possible to e.g. use multiple threads here?
Preprocessing takes also long time. I wonder could this also benefit from concurrency?
NCCL seems to require LUAJIT (I experienced this: https://github.com/ngimel/nccl.torch/issues/6) so when running with LUA 5.2, NCCL cannot be used. So I guess it’s important to get LUAJIT to work also because of this.
Unfortunately I have only little experience of LUA so I can’t provide more details. Anyway I’m able to train with LUA 5.2 and -no_nccl. I just wonder how much no_nccl slows things down?
How large is your dataset in terms of sentence count?
Some comments on the points you listed:
Luajit’s out of memory error is a recurring issue. While we put an effort to mitigate the issue with large datasets by using alternative data structures that do not rely on the Lua memory allocator, it may still reach the limit under some circumstances. For example, long sequences are hard to mitigate as there are a lot of cloning operations. Usually, going for Lua 5.2 is the easiest workaround and the slowdown during training is only around 5%.
Did you change the maximum sequence length by any chance?
When does the error occurs during the training?
Are you positive it is a Lua memory error and not a CUDA memory error?
For the model saving and loading, OpenNMT relies on Torch’s built-in serialization. It is not the fastest serializer/deserializer but I never saw a model that needs one hour to be loaded. It should be a few seconds, even with Lua 5.2. Maybe there is something going on with your Torch installation or your hardware. You should try loading the model on another machine.
The preprocessing is mostly a hard drive intensive process. Not sure it would benefit from parallelism but we can give it a shot as we added it to the tokenization.
I accidentally said “loading model is slow” when I meant loading the training data is slow.
Also some additional notes about training with 8 GPU’s. It looks like GPU’s are sitting idle quite often:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57 Driver Version: 367.57 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:17.0 Off | 0 |
| N/A 77C P0 138W / 149W | 6606MiB / 11439MiB | 94% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 0000:00:18.0 Off | 0 |
| N/A 53C P0 68W / 149W | 8172MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 0000:00:19.0 Off | 0 |
| N/A 76C P0 71W / 149W | 6825MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 0000:00:1A.0 Off | 0 |
| N/A 59C P0 82W / 149W | 7922MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla K80 Off | 0000:00:1B.0 Off | 0 |
| N/A 78C P0 71W / 149W | 7242MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla K80 Off | 0000:00:1C.0 Off | 0 |
| N/A 60C P0 83W / 149W | 8653MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla K80 Off | 0000:00:1D.0 Off | 0 |
| N/A 79C P0 78W / 149W | 8723MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 63C P0 81W / 149W | 7356MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
I’m not sure is this because of -no_nccl
Also based on training speed, I’m not totally convinced that GPU’s are fully used. Training the single epoch with single GPU takes around four days and with 8 GPU’s it looks like it will be two days.
Apparently I was using synchronous mode. I tried with async mode now and got following exception:
[02/04/17 14:41:27 INFO] Start training...
[02/04/17 14:41:27 INFO]
/home/ubuntu/torch/install/bin/lua: ...e/ubuntu/torch/install/share/lua/5.2/threads/threads.lua:183: [thread 2 callback] ./onmt/modules/Encoder.lua:131: attempt to index local 'batch' (a nil value)
stack traceback:
./onmt/modules/Encoder.lua:131: in function 'forward'
train.lua:251: in function 'trainNetwork'
train.lua:369: in function <train.lua:332>
(...tail calls...)
[C]: in function 'xpcall'
...e/ubuntu/torch/install/share/lua/5.2/threads/threads.lua:234: in function 'callback'
/home/ubuntu/torch/install/share/lua/5.2/threads/queue.lua:65: in function </home/ubuntu/torch/install/share/lua/5.2/threads/queue.lua:41>
[C]: in function 'pcall'
/home/ubuntu/torch/install/share/lua/5.2/threads/queue.lua:40: in function 'dojob'
[string " local Queue = require 'threads.queue'..."]:13: in main chunk
stack traceback:
[C]: in function 'error'
...e/ubuntu/torch/install/share/lua/5.2/threads/threads.lua:183: in function 'dojob'
...e/ubuntu/torch/install/share/lua/5.2/threads/threads.lua:264: in function 'synchronize'
./onmt/utils/Parallel.lua:87: in function 'launch'
train.lua:332: in function 'trainEpoch'
train.lua:421: in function 'trainModel'
train.lua:566: in function 'main'
train.lua:571: in main chunk
[C]: in function 'dofile'
...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
THCudaCheck FAIL file=/home/ubuntu/torch/extra/cutorch/lib/THC/generic/THCStorage.c line=55 error=29 : driver shutting down
Hello @pttr! for speedup with parallelism, generally the main constraint is the hardware and the synchronization time between GPUs. Whatever you do, you need to make sure that the time to transfer data between GPU is small compare to compute time - your parameters are the network size and your batch size - but the former is not always interesting to change, while the second has some limitations.
In sync mode, nccl is there to reduce the latency during the replica on each GPU - since you are using a K80, it should work quite well. I don’t know why nccl does not work with luajit, I will have a look - can you please open an issue on github for that?
In async mode, the problem is reduced but 1/ you have to dedicate one GPU as a master (which is not a problem when you have 8 GPUs), 2/ async does not work at the very beginning, so you have to start on a pretrain network (for instance 1 epoch).
In general, you can not get too much parallel - in sync mode, it is equivalent to augment the batch size, which has some limitation for small networks. in async mode, you can generally get more parallel workers but the problem is the beginning of the training.
On my side, with 8 K80 GPU - I manage to get average GPU usage of about 80% on all the GPUs - and a global speed-up of about x6 - but an effective speedup (looking at perplexity) of about x3-4.
I will publish soon some numbers on parallel trainings.
I will be grateful if you could share some numbers on parallel trainings,
I am running a training on 5 GPUs using the async mode and 6000 async_parallel_minbatch and 120 batch size but it looks like the training time is not reduced as it should be,
I remember previous version can releaze the balance of multiple GPU. Before 4 months, I used 8 GPU and can obtain 95% on each card. But now, I update and find the same problem as above mentioned.
Hello, what is important in multi-gpu training is the communication speed between the GPUs. It is unlikely that with such set-up you have optimal throughput - see here for an overview of the problem.
The tests I did with 5-8 GPUs were using K80 clusters or (expensive) DGX-1 that are both optimized.
So, for mere mortals that can only afford consumer products, it seems currently multi-GPU setups are a waste of money. It’s a pitty SLI is only used for graphics…
Please let me know if my understanding is correct: hardware-wise, we could probably have some benefit if we used at most 2 GPUs in PCIe 16 slots, but in sync mode the benefit would be minimal and in async mode one of the GPUs is used as master so…
I guess the best option with consumer products is to use only 1 GPU, the fastest possible (1080 Ti).
Also, next year we should expect consumer motherboards and CPUs with PCIe ver.4 and the new nVidia Volta. Do you think these would allow considerable speed improvements with consumer setups with 3 or maybe 4 GPUs?
@panosk, yes - the 2 GPU set-up is affordable and effective, and it is what we are using generally.
@vince62s - I don’t believe it. for the same network, whatever is the framework, the lower level layers are using CUDA/NCCL and we should not see a difference. beside, when you look at the numbers of GPU used for training GNMT and divide by what you would expect for 1 GPU, you can see the huge waste.
@vince62s - Is that b/c of the way TF can split layers of a network across different cards? I imagine the 3.6 factor was using NVLink?
@panosk - Amazon just announced their EC2 p3 instances, which use V100 cards. The cheapest rates I’ve seen are a little over $3/hour, so it’s a reasonably affordable way to dip your toe in the water of industry-grade HW. And it’s quick & easy to get OpenNMT up & running on nvidia-docker using AWS.