Some experience when training with large datasets


First I have to say that what you guys have done is amazing. I’ve been playing with OpenNMT few weeks now and it works great with small datasets. Great job!

I’ve experienced few issues with bigger datasets and multi-gpu training. If these are not known issues, I can also create more detailed issues later when I’ve some time (unfortunately I forgot to copy my logs before terminating my instance). Most of these probably are not directly OpenNMT issues but maybe it makes sense to see whether there is a workaround for these.

  • Running train.lua with LUAJIT I got fatal errors (out of memory) from multiple threads. I guess this is the max 2gb table limit in LUAJIT but I’m far from LUA expert so this is just a guess. I’ve used AWS p2.8xlarge (a lot of memory and 8 gpu’s so hardware shouldn’t be the issue here). I read somewhere that instead of tables, you could use torch tensors which supposedly don’t have this limit.

  • Loading model with LUA 5.2 is slow. LUAJIT is a tad faster but not enough. I’m not sure is this OpenNMT issue but it takes almost an hour to load my training model. I’m not familiar how torch / OpenNMT loads these so I have no idea whether this could be optimized. Is it possible to e.g. use multiple threads here?

  • Preprocessing takes also long time. I wonder could this also benefit from concurrency?

  • NCCL seems to require LUAJIT (I experienced this: so when running with LUA 5.2, NCCL cannot be used. So I guess it’s important to get LUAJIT to work also because of this.

Unfortunately I have only little experience of LUA so I can’t provide more details. Anyway I’m able to train with LUA 5.2 and -no_nccl. I just wonder how much no_nccl slows things down?

Hi and welcome to the OpenNMT community! :slight_smile:

How large is your dataset in terms of sentence count?

Some comments on the points you listed:

  • Luajit’s out of memory error is a recurring issue. While we put an effort to mitigate the issue with large datasets by using alternative data structures that do not rely on the Lua memory allocator, it may still reach the limit under some circumstances. For example, long sequences are hard to mitigate as there are a lot of cloning operations. Usually, going for Lua 5.2 is the easiest workaround and the slowdown during training is only around 5%.

    • Did you change the maximum sequence length by any chance?
    • When does the error occurs during the training?
    • Are you positive it is a Lua memory error and not a CUDA memory error?
  • For the model saving and loading, OpenNMT relies on Torch’s built-in serialization. It is not the fastest serializer/deserializer but I never saw a model that needs one hour to be loaded. It should be a few seconds, even with Lua 5.2. Maybe there is something going on with your Torch installation or your hardware. You should try loading the model on another machine.

  • The preprocessing is mostly a hard drive intensive process. Not sure it would benefit from parallelism but we can give it a shot as we added it to the tokenization.

  • Maybe @jean.senellart knows something about this.

1 Like

Dataset is around 20M segments.

I accidentally said “loading model is slow” when I meant loading the training data is slow.

Also some additional notes about training with 8 GPU’s. It looks like GPU’s are sitting idle quite often:

| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 0000:00:17.0     Off |                    0 |
| N/A   77C    P0   138W / 149W |   6606MiB / 11439MiB |     94%      Default |
|   1  Tesla K80           Off  | 0000:00:18.0     Off |                    0 |
| N/A   53C    P0    68W / 149W |   8172MiB / 11439MiB |      0%      Default |
|   2  Tesla K80           Off  | 0000:00:19.0     Off |                    0 |
| N/A   76C    P0    71W / 149W |   6825MiB / 11439MiB |      0%      Default |
|   3  Tesla K80           Off  | 0000:00:1A.0     Off |                    0 |
| N/A   59C    P0    82W / 149W |   7922MiB / 11439MiB |      0%      Default |
|   4  Tesla K80           Off  | 0000:00:1B.0     Off |                    0 |
| N/A   78C    P0    71W / 149W |   7242MiB / 11439MiB |      0%      Default |
|   5  Tesla K80           Off  | 0000:00:1C.0     Off |                    0 |
| N/A   60C    P0    83W / 149W |   8653MiB / 11439MiB |      0%      Default |
|   6  Tesla K80           Off  | 0000:00:1D.0     Off |                    0 |
| N/A   79C    P0    78W / 149W |   8723MiB / 11439MiB |      0%      Default |
|   7  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   63C    P0    81W / 149W |   7356MiB / 11439MiB |      0%      Default |

I’m not sure is this because of -no_nccl

Also based on training speed, I’m not totally convinced that GPU’s are fully used. Training the single epoch with single GPU takes around four days and with 8 GPU’s it looks like it will be two days.

We might consider an alternative data format to speedup this loading. I will look into that.

As for the multi GPU training, are you using asynchronous or synchronous mode?

Apparently I was using synchronous mode. I tried with async mode now and got following exception:

[02/04/17 14:41:27 INFO] Start training...
[02/04/17 14:41:27 INFO]
/home/ubuntu/torch/install/bin/lua: ...e/ubuntu/torch/install/share/lua/5.2/threads/threads.lua:183: [thread 2 callback] ./onmt/modules/Encoder.lua:131: attempt to index local 'batch' (a nil value)
stack traceback:
        ./onmt/modules/Encoder.lua:131: in function 'forward'
        train.lua:251: in function 'trainNetwork'
        train.lua:369: in function <train.lua:332>
        (...tail calls...)
        [C]: in function 'xpcall'
        ...e/ubuntu/torch/install/share/lua/5.2/threads/threads.lua:234: in function 'callback'
        /home/ubuntu/torch/install/share/lua/5.2/threads/queue.lua:65: in function </home/ubuntu/torch/install/share/lua/5.2/threads/queue.lua:41>
        [C]: in function 'pcall'
        /home/ubuntu/torch/install/share/lua/5.2/threads/queue.lua:40: in function 'dojob'
        [string "  local Queue = require 'threads.queue'..."]:13: in main chunk
stack traceback:
        [C]: in function 'error'
        ...e/ubuntu/torch/install/share/lua/5.2/threads/threads.lua:183: in function 'dojob'
        ...e/ubuntu/torch/install/share/lua/5.2/threads/threads.lua:264: in function 'synchronize'
        ./onmt/utils/Parallel.lua:87: in function 'launch'
        train.lua:332: in function 'trainEpoch'
        train.lua:421: in function 'trainModel'
        train.lua:566: in function 'main'
        train.lua:571: in main chunk
        [C]: in function 'dofile'
        ...untu/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
        [C]: in ?
THCudaCheck FAIL file=/home/ubuntu/torch/extra/cutorch/lib/THC/generic/THCStorage.c line=55 error=29 : driver shutting down

Yes there was a recent bug… Can you update to latest?

Hello @pttr! for speedup with parallelism, generally the main constraint is the hardware and the synchronization time between GPUs. Whatever you do, you need to make sure that the time to transfer data between GPU is small compare to compute time - your parameters are the network size and your batch size - but the former is not always interesting to change, while the second has some limitations.

In sync mode, nccl is there to reduce the latency during the replica on each GPU - since you are using a K80, it should work quite well. I don’t know why nccl does not work with luajit, I will have a look - can you please open an issue on github for that?

In async mode, the problem is reduced but 1/ you have to dedicate one GPU as a master (which is not a problem when you have 8 GPUs), 2/ async does not work at the very beginning, so you have to start on a pretrain network (for instance 1 epoch).

In general, you can not get too much parallel - in sync mode, it is equivalent to augment the batch size, which has some limitation for small networks. in async mode, you can generally get more parallel workers but the problem is the beginning of the training.

On my side, with 8 K80 GPU - I manage to get average GPU usage of about 80% on all the GPUs - and a global speed-up of about x6 - but an effective speedup (looking at perplexity) of about x3-4.

I will publish soon some numbers on parallel trainings.

Hello Jean,

I will be grateful if you could share some numbers on parallel trainings,

I am running a training on 5 GPUs using the async mode and 6000 async_parallel_minbatch and 120 batch size but it looks like the training time is not reduced as it should be,

Do you have any advice ?


Hello @Massinissa - what is your hardware?

I remember previous version can releaze the balance of multiple GPU. Before 4 months, I used 8 GPU and can obtain 95% on each card. But now, I update and find the same problem as above mentioned.

Hello Jean,

I have 5 gpus (two 1070, two 1080 and 1080ti),
The cpu is Intel® Xeon® CPU E3-1220 v3 @ 3.10GHz
Motherboard is MSI z97 gaming 5, and 32gb of Ram


Hello, what is important in multi-gpu training is the communication speed between the GPUs. It is unlikely that with such set-up you have optimal throughput - see here for an overview of the problem.
The tests I did with 5-8 GPUs were using K80 clusters or (expensive) DGX-1 that are both optimized.

Hi @jean.senellart,

So, for mere mortals that can only afford consumer products, it seems currently multi-GPU setups are a waste of money. It’s a pitty SLI is only used for graphics…
Please let me know if my understanding is correct: hardware-wise, we could probably have some benefit if we used at most 2 GPUs in PCIe 16 slots, but in sync mode the benefit would be minimal and in async mode one of the GPUs is used as master so…
I guess the best option with consumer products is to use only 1 GPU, the fastest possible (1080 Ti).
Also, next year we should expect consumer motherboards and CPUs with PCIe ver.4 and the new nVidia Volta. Do you think these would allow considerable speed improvements with consumer setups with 3 or maybe 4 GPUs?

I would be very interested in understanding what is the difference in architecture or methodology versus Tensorflow’s training procedure.

With T2T / TF, I can get 3.8 times with 4 GTX 1080 ti in training speed vs 1 GPU.

@panosk, yes - the 2 GPU set-up is affordable and effective, and it is what we are using generally.

@vince62s - I don’t believe it. for the same network, whatever is the framework, the lower level layers are using CUDA/NCCL and we should not see a difference. beside, when you look at the numbers of GPU used for training GNMT and divide by what you would expect for 1 GPU, you can see the huge waste.

no you’re right I got 3.6 times :slight_smile:

I will share privately another’s user results, with the exact same numbers.

NB: not talking about GNMT, but T2T

@vince62s - Is that b/c of the way TF can split layers of a network across different cards? I imagine the 3.6 factor was using NVLink?

@panosk - Amazon just announced their EC2 p3 instances, which use V100 cards. The cheapest rates I’ve seen are a little over $3/hour, so it’s a reasonably affordable way to dip your toe in the water of industry-grade HW. And it’s quick & easy to get OpenNMT up & running on nvidia-docker using AWS.

@dbl no NVLink.
If you want to read more, read this first post:

And my numbers I posted today, here:

It’s plain synchroneous training.