Retrain a model with fixed emebeddings

Hi

I want to retrain a Transformer model and fine tune it with new data.

My vocabulary will grow and I will update it with : onmt-update-vocab

My model was set to train embeddings in the first phase but I would like to fix them for the retraining, is it possible ?

I guess I will need to extract embeddings from the model, is it implemented in the tf version ?

Thanks,
Valentin

Hi,

If you will update the vocabulary, why would you fix the embedding? Or do you want to only fix the embedding of non updated words?

Hi,

I will retrain my model on a corpus which contains new tokens related to context information like < doc1 >, embeddings for these new tokens depend on the original vocab embeddings.

So I want to extract these vocab embeddings after the first training, calculate the < doc > embeddings and restart my training specifying a file for every embeddings (new tokens + original ones). In fact I do not mind if there are fixed or not, at least in a first step

Is it possible ? Otherwise I would be forced to fix embeddings since the beginning, and it decreases significantly performances (on my transformer_big model)

Thanks

If you want to extract embeddings, it’s not too complicated to inspect the checkpoint files.

See for example this function that maps variable names to their value:

1 Like

Thanks, I managed to extract my embeddings from tf.Variable transformer/encoder/w_embs

It seems they are aligned with my vocab file which is great, but there’s a little thing that annoys me:
When I map the extracted embeddings to my vocab file, there is one more embedding corresponding to nothing in my vocab

I don’t know if it is a additional embedding or if there’s a shift somewhere in my mapping

Any idea ?

The last embedding vector is used for all tokens that are not found in the vocabulary.

1 Like

Thank a lot you for your answers

Regarding the original question, say I’ve trained my model for some steps, I can now extract my embeddings and make some modifications to them (forget about updating vocab, I don’t need it anymore)

Now I want to restart training for a few steps with my modified embeddings, I have the two following question:

  • Do I need to load my modified embeddings in a similar way I extracted them ? (I guess by modifying my checkpoint in a python script)
  • Can I fix embeddings for this retraining ?

I don’t think there is currently an easy way to change the embeddings mid-training. You can either rewrite the checkpoint or define and run a hook similar to this one to load your new embeddings.

To fix embeddings, you could set the trainable argument in the model definition.

http://opennmt.net/OpenNMT-tf/package/opennmt.inputters.text_inputter.html#opennmt.inputters.text_inputter.WordEmbedder

I decided to rewrite the checkpoint and it works well, thanks

1 Like