I have a model trained on a dataset for a number of epochs (let’s call it model_1). After stopping training, I retrain again using subset of the data (let’s call that one model_2), and after preprocessing I use -continue and I provide -src_vocab, -tgt_vocab, and -features_vocabs_prefix from model_1 preprocessing output. I assume that I retain the same architecture, vocabulary and options by doing so. when I use “average_models.lua” script to average different model_1 and model_2, I got this error:
/home/centos/torch/install/bin/lua: tools/average_models.lua:103: bad argument #2 to ‘add’ (sizes do not match at /home/centos/torch/extra/cutorch/lib/THC/generated/…/generic/THCTensorMathPointwise.cu:216)
stack traceback:
[C]: in function 'add’
tools/average_models.lua:103: in function 'main’
tools/average_models.lua:116: in main chunk
[C]: in function ‘dofile’
…ntos/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
I assumed that the model architecture and parameters are retained and then could be easily averaged but it doesn’t seem to be the case.
Any explanation or help would be greatly appreciated!
PREPROCESS_OPTS = -src_vocab_size 0 -tgt_vocab_size 0 -src_vocab {src vocab file from the training step} -tgt_vocab {tgt vocab file from the training step} -features_vocabs_prefix {features vocab file from the training step}
It seems I can’t reproduce the issue. Could you check that you did not make an error selecting or naming your models?
Also, it would be helpful to display the list of parameters as seen by the average_models.lua script:
diff --git a/tools/average_models.lua b/tools/average_models.lua
index de5a0c2..f288b26 100644
--- a/tools/average_models.lua
+++ b/tools/average_models.lua
@@ -98,6 +98,8 @@ local function main()
error('unable to load the model (' .. err .. ').')
end
local params = gatherParameters(checkpoint.models)
+ print(averageParams)
+ print(params)
for i = 1, #params do
-- Average in place.
averageParams[i]:mul(k-1):add(params[i]):div(k)
Thanks a lot Guillaume parameters visualization really helped!
It’s my mistake. It worked fine across different epochs but I got that error when I tried to average two different models (basically the vocabs were different).
Thanks again for your time