ValueError when retrain Transformer

muhammadfhadli1453 · May 27, 2019, 9:19pm

Hi everyone,

I have trained my Transformer model with 50.000 vocab, let’s called it Model 1. I trained it using this command below. I trained it with 100.000 steps. Experiment2_Model2_2 is my Model 1’s directory.

onmt-main train_and_eval --model_type Transformer --auto_config --config ~/Experiment2_Model2_2/config.yml --num_gpus 3

What I want to do is to take the weight from Model 1 and use it as initialization for Model 2. So I created a directory for Model 2 and put new train data, eval data, and 50000 vocab. Then I trained my Model 2 with this command below. Experiment2_ModelTransfer is my directory for Model 2.

onmt-main train_and_eval --model_type Transformer --auto_config --config ~/Experiment2_ModelTransfer/config.yml --checkpoint_path ~/Experiment2_Model2_2/model.ckpt-102826 --num_gpus 3

But when I run that command, I got this error

INFO:tensorflow:Training on 298532 examples
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Number of trainable parameters: 120992081
Traceback (most recent call last):
  File "/home/fhadli/anaconda3/envs/fhadli/bin/onmt-main", line 10, in <module>
    sys.exit(main())
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/opennmt/bin/main.py", line 172, in main
    runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/opennmt/runner.py", line 297, in train_and_evaluate
    result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 439, in train_and_evaluate
    executor.run()
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 518, in run
    self.run_local()
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/training.py", line 650, in run_local
    hooks=train_hooks)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 363, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 843, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 859, in _train_model_default
    saving_listeners)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 1056, in _train_with_estimator_spec
    log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 405, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 816, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 532, in __init__
    h.begin()
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/opennmt/utils/hooks.py", line 295, in begin
    tf_vars.append(tf.get_variable(name, shape=value.shape, dtype=tf.as_dtype(value.dtype)))
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 1317, in get_variable
    constraint=constraint)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 1079, in get_variable
    constraint=constraint)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 425, in get_variable
    constraint=constraint)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 394, in _true_getter
    use_resource=use_resource, constraint=constraint)
  File "/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 738, in _get_single_variable
    found_var.get_shape()))
ValueError: Trying to share variable transformer/decoder/dense/bias, but specified shape (48373,) and found shape (50001,).

The size of my w_emb is here:

Model 1
transformer_decoder_w_embs.txt = 48393
transformer_encoder_w_embs.txt = 50021
Model 2
transformer_decoder_w_embs.txt = 50002
transformer_encoder_w_embs.txt = 50001

Does anyone know how to get out from this error? I could not find similar issue so I will provide more information if needed. Thank you

guillaumekln · May 28, 2019, 6:03am

Hi,

You should use the script onmt-update-vocab to change the vocabulary size of a checkpoint.

muhammadfhadli1453 · May 28, 2019, 9:23am

Hi, thank you for your fast response.

Yes, I just realized that I had to do onmt-update-vocab before train it again. So I update it with this command:

onmt-update-vocab --model_dir=/home/fhadli/Experiment2_Model2_2/ --output_dir=/home/fhadli/Experiment2_ModelTransfer --src_vocab=/home/fhadli/Experiment2_Model2_2/src-vocab.txt --tgt_vocab=/home/fhadli/Experiment2_Model2_2/tgt-vocab.txt --new_src_vocab=/home/fhadli/Experiment2_ModelTransfer/src-vocab.txt --new_tgt_vocab=/home/fhadli/Experiment2_ModelTransfer/tgt-vocab.txt -mode replace

After that, I run the training again using this command. Experiment2_Model2_2 is my old directory and Experiment2_ModelTransfer is my new directory where I saved the finetuned checkpoint:

onmt-main train_and_eval --model_type Transformer --auto_config --config ~/Experiment2_Model2_2/config.yml --checkpoint_path ~/Experiment2_ModelTransfer/model.ckpt-102826 --num_gpus 3

And in my config file, I set the directory into Experiment2_ModelTransfer (The new directory) but I get this error

INFO:tensorflow:Training on 23134059 examples
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Number of trainable parameters: 119323381
Traceback (most recent call last):
File “/home/fhadli/anaconda3/envs/fhadli/bin/onmt-main”, line 10, in
sys.exit(main())
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/opennmt/bin/main.py”, line 172, in main
runner.train_and_evaluate(checkpoint_path=args.checkpoint_path)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/opennmt/runner.py”, line 297, in train_and_evaluate
result = tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 439, in train_and_evaluate
executor.run()
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 518, in run
self.run_local()
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/training.py”, line 650, in run_local
hooks=train_hooks)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 363, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 843, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 859, in _train_model_default
saving_listeners)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py”, line 1056, in _train_with_estimator_spec
log_step_count_steps=self._config.log_step_count_steps) as mon_sess:
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 405, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 816, in init
stop_grace_period_secs=stop_grace_period_secs)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py”, line 532, in init
h.begin()
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/opennmt/utils/hooks.py”, line 295, in begin
tf_vars.append(tf.get_variable(name, shape=value.shape, dtype=tf.as_dtype(value.dtype)))
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py”, line 1317, in get_variable
constraint=constraint)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py”, line 1079, in get_variable
constraint=constraint)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py”, line 425, in get_variable
constraint=constraint)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py”, line 394, in _true_getter
use_resource=use_resource, constraint=constraint)
File “/home/fhadli/anaconda3/envs/fhadli/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py”, line 738, in _get_single_variable
found_var.get_shape()))
ValueError: Trying to share variable transformer/decoder/dense/bias, but specified shape (64288,) and found shape (48373,).

The things that I am confuse about is that my old and new vocab has the same size (50.000), but I don’t know why it is different. Do you have any suggestion what I should do? Thank you

guillaumekln · May 28, 2019, 10:04am

Did you also set the new vocabulary in the configuration file?

muhammadfhadli1453 · May 28, 2019, 10:09am

Yes, my config file looks like this. Training, eval, and vocab are the new one. Vocab size is 50.000 (the same with the old model. While Training and eval has a different size (smaller)

model_dir: /home/fhadli/Experiment2_ModelTransfer/

data:
train_features_file: /home/fhadli/Experiment2_ModelTransfer/train_idzh_zh.txt
train_labels_file: /home/fhadli/Experiment2_ModelTransfer/train_idzh_id.txt

eval_features_file: /home/fhadli/Experiment2_ModelTransfer/eval_idzh_zh.txt
eval_labels_file: /home/fhadli/Experiment2_ModelTransfer/eval_idzh_id.txt

source_words_vocabulary: /home/fhadli/Experiment2_ModelTransfer/src-vocab.txt
target_words_vocabulary: /home/fhadli/Experiment2_ModelTransfer/tgt-vocab.txt

And another question is, do you know what is the function of file decoder-merge-tgt-vocab.txt and encoder-merge-src-vocab.txt ? Because after I run this command, those files always appear.

onmt-main train_and_eval --model_type Transformer --auto_config --config ~/Experiment2_ModelTransfer/config.yml --checkpoint_path ~/Experiment2_ModelTransfer/model.ckpt-102826 --num_gpus 3

Thank you

guillaumekln · May 29, 2019, 7:21am

Did you use the mode “merge” in onmt-update-vocab? If yes, these are the vocabularies you should set in your configuration file.