I will retrain my model on a corpus which contains new tokens related to context information like < doc1 >, embeddings for these new tokens depend on the original vocab embeddings.
So I want to extract these vocab embeddings after the first training, calculate the < doc > embeddings and restart my training specifying a file for every embeddings (new tokens + original ones). In fact I do not mind if there are fixed or not, at least in a first step
Is it possible ? Otherwise I would be forced to fix embeddings since the beginning, and it decreases significantly performances (on my transformer_big model)
Thanks, I managed to extract my embeddings from tf.Variable transformer/encoder/w_embs
It seems they are aligned with my vocab file which is great, but there’s a little thing that annoys me:
When I map the extracted embeddings to my vocab file, there is one more embedding corresponding to nothing in my vocab
I don’t know if it is a additional embedding or if there’s a shift somewhere in my mapping
Regarding the original question, say I’ve trained my model for some steps, I can now extract my embeddings and make some modifications to them (forget about updating vocab, I don’t need it anymore)
Now I want to restart training for a few steps with my modified embeddings, I have the two following question:
Do I need to load my modified embeddings in a similar way I extracted them ? (I guess by modifying my checkpoint in a python script)
I don’t think there is currently an easy way to change the embeddings mid-training. You can either rewrite the checkpoint or define and run a hook similar to this one to load your new embeddings.
To fix embeddings, you could set the trainable argument in the model definition.