Hello @pttr! for speedup with parallelism, generally the main constraint is the hardware and the synchronization time between GPUs. Whatever you do, you need to make sure that the time to transfer data between GPU is small compare to compute time - your parameters are the network size and your batch size - but the former is not always interesting to change, while the second has some limitations.
In sync mode,
nccl is there to reduce the latency during the replica on each GPU - since you are using a K80, it should work quite well. I don't know why
nccl does not work with luajit, I will have a look - can you please open an issue on github for that?
In async mode, the problem is reduced but 1/ you have to dedicate one GPU as a master (which is not a problem when you have 8 GPUs), 2/ async does not work at the very beginning, so you have to start on a pretrain network (for instance 1 epoch).
In general, you can not get too much parallel - in sync mode, it is equivalent to augment the batch size, which has some limitation for small networks. in async mode, you can generally get more parallel workers but the problem is the beginning of the training.
On my side, with 8 K80 GPU - I manage to get average GPU usage of about 80% on all the GPUs - and a global speed-up of about x6 - but an effective speedup (looking at perplexity) of about x3-4.
I will publish soon some numbers on parallel trainings.