Can dispatching batches with different src_len degrade performance in synchronous training

xiaoda99 · July 25, 2017, 11:37am

Since batches are permutated before dispatched to worker threads, the threads will get batches of different src_len. model:trainNetwork on batch with smaller src_len will return faster. In synchronous training mode all other threads have to wait for the thread with the largest batch to finish, so parallelism is degraded.
I try to propose a solution here: Say parallel.count is 4 and max_batch_size is 128. Instead of using batch size of 128 when building dataset and shuffling, we can use batch size of 128 * 4 = 512. Only when dispatched to 4 threads are the big 512 batch split to 4 small 128 batches and dispatched to 4 threads. Therefore, the threads get batches of the same size and will finish at the same time and no waiting occur.
Does my solution make sense and if so, is it easy to implement it?

guillaumekln · July 25, 2017, 2:09pm

It makes sense and it is not too difficult to implement. The least pleasant part would be to implement the logic of splitting the batch.

A simpler approach in the current code, would be to shuffle blocks of 4 consecutive batches as the data is ordered by source length from the preprocessing.

If you experiment with one of those approaches, we would like to hear about your results.

xiaoda99 · August 3, 2017, 3:07pm

Hi, all, here I would like to share some results of experimenting with this. Any feedback or advice is appreciated.
I choose the approach suggested by Guillaume to shuffle blocks of 4 consecutive batches. It’s not hard to implement.
Basically, I replace torch.randperm with groupedRandPerm (code pasted below) to shuffle batch order in trainEpoch. I use 3GPUs to train a 6-layer residual network and get a speedup of 2.33 compared to using a single GPU, while the original OpenNMT code can only get a speedup of 1.91. I can’t test with more GPUs due to memory limit of luajit (https://github.com/OpenNMT/OpenNMT/issues/350).
Somewhat surprisingly, I also found that using of not using nccl make little difference.

local function groupedRandPerm(batchCount, parallelCount)
  local groupCount = math.ceil(batchCount / parallelCount)
  local batchOrder = torch.DoubleTensor(groupCount, parallelCount)
  local batchGroupOrder = torch.randperm(groupCount)
  for i = 1, parallelCount do batchOrder[{{}, i}]:copy((batchGroupOrder - 1) * parallelCount + i) end
  return batchOrder:view(batchOrder:nElement())
end

Da Xiao

guillaumekln · August 4, 2017, 12:57am

That’s great, thanks!

Would you consider making a pull request? I think we can make this the default behavior for synchronous multi GPU.

jean.senellart · August 22, 2017, 10:40pm

@xiaoda99 - this is great - thanks for sharing. let us know if you prefer us to integrate the function (through a PR, you would be identified as a contributor).

I am a bit surprised about your finding about nccl - what is your hardware?

Thanks
Jean