I develop my own system based on commit 33233f0 on Feb 23, 2017. Since I’ve made quite heavy changes to the original opennmt code, it is very difficult for me to merge my current dev version with latest opennmt commit. Now I want to use multi GPU to speed up training. So I just want to know if the parallel training code of Feb 23 commit works properly. In other words, are there any major revision or bug fix to the parallel training code from Feb 23 to now?
I think the files related to parallel training are train.lua, onmt/train/Trainer.lua and onmt/utils/Parallel.lua. I compared my Trainer.lua to the latest version and found a major difference:
In old code, the model parameters are initialized multiple times by different threads
function Trainer:train(model, optim, trainData, validData, dataset, info)
local params, gradParams = {}, {}
onmt.utils.Parallel.launch(function(idx)
– Only logs information of the first thread.
local verbose = idx == 1
-- Initialize and get model parameters.
_G.params, _G.gradParams = _G.model:initParams(verbose)
In latest code, the model parameters are initialized once and cloned:
onmt.utils.Parallel.launch(function(idx)
_G.model = idx == 1 and model or onmt.utils.Tensor.deepClone(model)
if self.params[idx] then
_G.params, _G.gradParams = self.params[idx], self.gradParams[idx]
else
_G.params, _G.gradParams = _G.model:getParams(true)
end
I think the latter is the right way because in former different threads may have different init values of parameters at the very beginning. Am I right? Are there any other important difference between the old and latest code that I missed?
- Da Xiao