How to use multi-GPU parallel training with an old commit on Feb 23

(Xiaoda99) #1

I develop my own system based on commit 33233f0 on Feb 23, 2017. Since I’ve made quite heavy changes to the original opennmt code, it is very difficult for me to merge my current dev version with latest opennmt commit. Now I want to use multi GPU to speed up training. So I just want to know if the parallel training code of Feb 23 commit works properly. In other words, are there any major revision or bug fix to the parallel training code from Feb 23 to now?
I think the files related to parallel training are train.lua, onmt/train/Trainer.lua and onmt/utils/Parallel.lua. I compared my Trainer.lua to the latest version and found a major difference:
In old code, the model parameters are initialized multiple times by different threads

function Trainer:train(model, optim, trainData, validData, dataset, info)
local params, gradParams = {}, {}

– Only logs information of the first thread.
local verbose = idx == 1

-- Initialize and get model parameters.
_G.params, _G.gradParams = _G.model:initParams(verbose)

In latest code, the model parameters are initialized once and cloned:

_G.model = idx == 1 and model or onmt.utils.Tensor.deepClone(model)

if self.params[idx] then
  _G.params, _G.gradParams = self.params[idx], self.gradParams[idx]
  _G.params, _G.gradParams = _G.model:getParams(true)

I think the latter is the right way because in former different threads may have different init values of parameters at the very beginning. Am I right? Are there any other important difference between the old and latest code that I missed?

  • Da Xiao

(Guillaume Klein) #2

The new approach of building the models is cleaner but actually equivalent because each thread has the same random generator state. Thus, the parameters were initialized with the same values.

Even if the code moved a lot since this date, I can’t think of a major change done to the multi GPU feature.