Average models after retraining error

talaat · August 16, 2017, 3:29pm

Hello all,

I have a model trained on a dataset for a number of epochs (let’s call it model_1). After stopping training, I retrain again using subset of the data (let’s call that one model_2), and after preprocessing I use -continue and I provide -src_vocab, -tgt_vocab, and -features_vocabs_prefix from model_1 preprocessing output. I assume that I retain the same architecture, vocabulary and options by doing so. when I use “average_models.lua” script to average different model_1 and model_2, I got this error:

/home/centos/torch/install/bin/lua: tools/average_models.lua:103: bad argument #2 to ‘add’ (sizes do not match at /home/centos/torch/extra/cutorch/lib/THC/generated/…/generic/THCTensorMathPointwise.cu:216)
stack traceback:
[C]: in function 'add’
tools/average_models.lua:103: in function 'main’
tools/average_models.lua:116: in main chunk
[C]: in function ‘dofile’
…ntos/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?

I assumed that the model architecture and parameters are retained and then could be easily averaged but it doesn’t seem to be the case.

Any explanation or help would be greatly appreciated!

Thanks,
Talaat

guillaumekln · August 16, 2017, 3:46pm

Hi,

Could you share the command lines or at least the options you used for:

the initial training
the retraining

Thanks.

talaat · August 17, 2017, 12:57pm

Hey Guillaume,
Sure…For initial Preprocessing and Training:

PREPROCESS_OPTS= -src_vocab_size 0 -tgt_vocab_size 0

TRAIN_OPTS= -gpuid 1 2 -learning_rate_decay 0.7 -word_vec_size 1000 -rnn_size 1000 -layers 4 -start_decay_at 20 -save_every 3000 -brnn -residual -max_batch_size 120 -async_parallel -async_parallel_minbatch 6000

For Retraining:

PREPROCESS_OPTS = -src_vocab_size 0 -tgt_vocab_size 0 -src_vocab {src vocab file from the training step} -tgt_vocab {tgt vocab file from the training step} -features_vocabs_prefix {features vocab file from the training step}

RETRAIN_OPTS=-gpuid 1 2 -continue -save_every 3000 -max_batch_size 120 -async_parallel -async_parallel_minbatch 6000 -train_from {train_model.t7} -data {retrain preprocessed data }

Thanks a lot!

guillaumekln · August 17, 2017, 1:25pm

It seems I can’t reproduce the issue. Could you check that you did not make an error selecting or naming your models?

Also, it would be helpful to display the list of parameters as seen by the average_models.lua script:

diff --git a/tools/average_models.lua b/tools/average_models.lua
index de5a0c2..f288b26 100644
--- a/tools/average_models.lua
+++ b/tools/average_models.lua
@@ -98,6 +98,8 @@ local function main()
         error('unable to load the model (' .. err .. ').')
       end
       local params = gatherParameters(checkpoint.models)
+      print(averageParams)
+      print(params)
       for i = 1, #params do
         -- Average in place.
         averageParams[i]:mul(k-1):add(params[i]):div(k)

That way we will see where the size mismatch is.

talaat · August 17, 2017, 1:56pm

Thanks a lot Guillaume parameters visualization really helped!
It’s my mistake. It worked fine across different epochs but I got that error when I tried to average two different models (basically the vocabs were different).
Thanks again for your time