Retrain a model with fixed emebeddings

valentinmace · April 29, 2019, 5:24pm

Hi

I want to retrain a Transformer model and fine tune it with new data.

My vocabulary will grow and I will update it with : onmt-update-vocab

My model was set to train embeddings in the first phase but I would like to fix them for the retraining, is it possible ?

I guess I will need to extract embeddings from the model, is it implemented in the tf version ?

Thanks,
Valentin

guillaumekln · April 29, 2019, 7:47pm

Hi,

If you will update the vocabulary, why would you fix the embedding? Or do you want to only fix the embedding of non updated words?

valentinmace · April 30, 2019, 8:50am

Hi,

I will retrain my model on a corpus which contains new tokens related to context information like < doc1 >, embeddings for these new tokens depend on the original vocab embeddings.

So I want to extract these vocab embeddings after the first training, calculate the < doc > embeddings and restart my training specifying a file for every embeddings (new tokens + original ones). In fact I do not mind if there are fixed or not, at least in a first step

Is it possible ? Otherwise I would be forced to fix embeddings since the beginning, and it decreases significantly performances (on my transformer_big model)

Thanks

guillaumekln · May 2, 2019, 5:32pm

If you want to extract embeddings, it’s not too complicated to inspect the checkpoint files.

See for example this function that maps variable names to their value:

github.com

OpenNMT/OpenNMT-tf/blob/v1.22.1/opennmt/utils/checkpoint.py#L109




    with tf.Session(config=session_config) as sess:
      sess.run(tf.global_variables_initializer())
      for p, assign_op, value in zip(placeholders, assign_ops, six.itervalues(variables)):
        sess.run(assign_op, {p: value})
      tf.logging.info("Saving new checkpoint to %s" % output_dir)
      saver.save(sess, out_base_file, global_step=global_step)


  return output_dir


def get_checkpoint_variables(checkpoint_path):
  """Returns variables included in a checkpoint.


  Args:
    checkpoint_path: Path to the checkpoint.


  Returns:
    A dictionary mapping variables name to value.
  """
  reader = tf.train.load_checkpoint(checkpoint_path)
  return {

valentinmace · May 3, 2019, 3:31pm

Thanks, I managed to extract my embeddings from tf.Variable transformer/encoder/w_embs

It seems they are aligned with my vocab file which is great, but there’s a little thing that annoys me:
When I map the extracted embeddings to my vocab file, there is one more embedding corresponding to nothing in my vocab

I don’t know if it is a additional embedding or if there’s a shift somewhere in my mapping

Any idea ?

guillaumekln · May 3, 2019, 4:23pm

The last embedding vector is used for all tokens that are not found in the vocabulary.

valentinmace · May 6, 2019, 9:50am

Thank a lot you for your answers

Regarding the original question, say I’ve trained my model for some steps, I can now extract my embeddings and make some modifications to them (forget about updating vocab, I don’t need it anymore)

Now I want to restart training for a few steps with my modified embeddings, I have the two following question:

Do I need to load my modified embeddings in a similar way I extracted them ? (I guess by modifying my checkpoint in a python script)
Can I fix embeddings for this retraining ?

guillaumekln · May 9, 2019, 9:36am

I don’t think there is currently an easy way to change the embeddings mid-training. You can either rewrite the checkpoint or define and run a hook similar to this one to load your new embeddings.

To fix embeddings, you could set the trainable argument in the model definition.

http://opennmt.net/OpenNMT-tf/package/opennmt.inputters.text_inputter.html#opennmt.inputters.text_inputter.WordEmbedder

valentinmace · May 13, 2019, 8:38am

I decided to rewrite the checkpoint and it works well, thanks