No, I meant this on the checkpoint level. Meaning we would convert a checkpoint trained on a set vocabulary to the same one but with a potentially different vocab.
This would require some manipulation of the model/optim parameters themselves.
Mmm, I see. So it is not required to update src and tgt vocab files? Aren’t they needed to set the known tokens? All tokens out of those files will be <unk>
?
Can you give some pointers for the checkpoint manipulation?
Of course src and tgt vocab should be updated, but that’s quite trivial since the whole idea here is to update your vocab, so you’re supposed to know how you want to update it.
Here are a few pointers to what would be required:
- add the new tokens to the torchtext fields/vocab objects, retaining the original ids of existing tokens;
- create a new model with the new vocab dimensions (
build_base_model
); - update the new model (and generator) with the older parameters, potentially the optim as well. Inner layers such as encoder/decoder should be transferable ‘as is’, but embeddings and generator layers would require some attention.
I think I’ve managed to get the first point (extending existing vocabularies via Torch vocab.extend()
). However, I cannot find where are the vocab dimensions in the checkpoint and where should I modify them. In the build_base_model
:
# Load the model states from checkpoint or initialize them.
if checkpoint is not None:
# This preserves backward-compat for models using customed layernorm
def fix_key(s):
s = re.sub(r'(.*)\.layer_norm((_\d+)?)\.b_2',
r'\1.layer_norm\2.bias', s)
s = re.sub(r'(.*)\.layer_norm((_\d+)?)\.a_2',
r'\1.layer_norm\2.weight', s)
return s
**checkpoint['model'] = {fix_key(k): v**
** for k, v in checkpoint['model'].items()}**
# end of patch for backward compatibility
model.load_state_dict(checkpoint['model'], strict=False)
generator.load_state_dict(checkpoint['generator'], strict=False)
I guess it is here where the checkpoint is loaded. Where are the vocabulary dimensions?
The vocab dimensions should only intervene in the embeddings as well as the generator.
I think if you naively update the vocab and pass a checkpoint with the old vocab you might encounter some dimension mismatch when loading the state_dicts in the section you quote.
What you probably want to do is to model.load_state_dict
of only encoder
and decoder
layers (pop the embeddings related keys of your checkpoint['model']
).
And then partially replace the embeddings
parameters of your model (where the vocab matches), as well as those of the generator
.
This is what I’ve done so far:
Extend model vocabulary with new words, appending new words at the end:
Old vocab: word1 word2 word3
New vocab: word1 word3 word4
Extended vocab: word1 word2 word3 + word4
New model’s embeddings are initialized as always, so update those embeddings with embeddings learned in checkpoint:
if model_opt.update_embeddings:
old_enc_emb_size = checkpoint["model"]["encoder.embeddings.make_embedding.emb_luts.0.weight"].shape[0]
old_dec_emb_size = checkpoint["model"]["decoder.embeddings.make_embedding.emb_luts.0.weight"].shape[0]
model.state_dict()["encoder.embeddings.make_embedding.emb_luts.0.weight"][:old_enc_emb_size] = checkpoint["model"]["encoder.embeddings.make_embedding.emb_luts.0.weight"][:]
model.state_dict()["decoder.embeddings.make_embedding.emb_luts.0.weight"][:old_dec_emb_size] = checkpoint["model"]["decoder.embeddings.make_embedding.emb_luts.0.weight"][:]
generator.state_dict()["0.weight"][:old_dec_emb_size] = checkpoint["generator"]["0.weight"][:]
generator.state_dict()["0.bias"][:old_dec_emb_size] = checkpoint["generator"]["0.bias"]
del checkpoint["model"]["encoder.embeddings.make_embedding.emb_luts.0.weight"]
del checkpoint["model"]["decoder.embeddings.make_embedding.emb_luts.0.weight"]
del checkpoint["generator"]["0.weight"]
del checkpoint["generator"]["0.bias"]
model.load_state_dict(checkpoint['model'], strict=False)
generator.load_state_dict(checkpoint['generator'], strict=False)
The code above assumes embedding lookup tables have the same order, as new words were appended to vocabulary:
Old embeddings: word1 (0, learned), word2 (1, learned), word3 (2, learned)
New embeddings: word1 (0, learned), word2 (1, learned), word3(2, learned), word4 (3, initialized as always)
Does this make sense?
Yes this looks quite okay to me. The important point being what you wrote:
The code above assumes embedding lookup tables have the same order
Also, you would need to make sure your del
operations don’t delete the parameters you just updated though. Or is it running fine already?
del
operations are deleting embedding weights from the checkpoint. Updated embeddings are assigned to the new model being generated and then load_state_dict
loads the rest of the leraned parameters from the checkpoint to the new model. So the new embeddings are not deleted.
By the way, are you interested in a PR for this feature?
del
operations are deleting embedding weights from the checkpoint. Updated embeddings are assigned to the new model being generated and thenload_state_dict
loads the rest of the leraned parameters from the checkpoint to the new model. So the new embeddings are not deleted.
I thought there could be some issues since your new parameters would refer to the ones from the checkpoint objects, but if it works that’s all good.
By the way, are you interested in a PR for this feature?
Sure, that would be interesting. You would need to make this quite robust to prevent any issue with potential id/dimension mismatch, order change, etc.
It would also probably be a good idea to add a functional test as well as an entry in the docs FAQ.
I guess order change or id/dimension mismatch is not an issue.
This is what the new training procedure would be:
- Run
build_vocab
as usual. This would generate the vocabulary files for the new corpora. - Then in the training procedure:
- Load checkpoint (fields would be populated with the checkpoint’s vocab)
- Build fields for the new vocabulary as usual from the vocabulary files generated in step1.
- Extend checkpoint fields vocabulary with the new vocabulary fields (only appending new words at the end, as stated in Torch docs)
- Assign the extended vocabulary to the new models fields
- In
build_base_model
, after loading the checkpoint’sstate_dicts
, replace new model’s embeddings from 0 tolen(checkpoint.vocab_size)
with checkpoint’s embeddings. It should be in the same order as in the extended vocabulary new words were appended at the end. - Remove embeddings parameters from checkpoint as new embeddings have been already added to the model
- Continue as usual
I guess you check the PR code before merging it, don’t you? Would you mind checking if something is missing from a whole picture perspective?
Furthermore, would you give me some guidelines on which functional tests would be necessary?
Your approach seems fine indeed. You can prepare a draft PR and I’ll have a look.
Tests : you can probably make some end to end test of the whole process with some toy vocab / model such as those defined here:
I have committed my changes for the draft. How can I do a PR?
I guess I need to commit my changes into a new branch and then open a new pull request, is that ok?
Hi all,
I’m also facing the problem of fine-tuning the pretraining model.
My purpose is to train a model to automatically repair some kinds of code bugs.
So first I trained a pretrain model on a dataset of 1.7 million statements. Then I use -train_from for 20000 steps with a new domain dataset of 10K statements. However, the performance is worse than if I trained directly on the small data set.
Some questions that confuse me are that:
1: Do I need to update the checkpoint vocab as you have discussed previously or just train_from with the new 10K dataset or I should mix two datasets in a certain ratio?
2: Is bpe or other subword tokenizers have a significant impact on performance?
I used to use only spaces to tokenize, I am now trying to use sentencepiece.
3: Do I need to use a small learning rate when fine-tuning?
I checked some discussions and they have mentioned that the finetuning process should use a very small learning rate and different warmup
Thanks
1: Do I need to update the checkpoint vocab as you have discussed previously or just train_from with the new 10K dataset or I should mix two datasets in a certain ratio?
If you use the same tokenization and vocab you don’t need to update the checkpoint’s vocab.
Yes, especially if the domain is new, you should keep some of the original data when finetuning. You can use the weighting mechanism to subsample/oversample various datasets.
2: Is bpe or other subword tokenizers have a significant impact on performance?
I used to use only spaces to tokenize, I am now trying to use sentencepiece.
Then, back to question 1. – it’s not a good idea to finetune a model pretrained on space tokenization on some data sentencepiece tokenized.
3: Do I need to use a small learning rate when fine-tuning?
You’ll probably need to experiment a bit there. If you’re using train_from
, retaining the states, you will already start from a later step, when learning rate will be low. If you’re resetting states, you might want to do a bit of warmup, but your mileage may vary.
Thanks for your reply, the second point is what I did not explain clearly.
I mean I have trained the pretrain model and then finetuned it. Both of the two steps are based on space tokenization. But the prediction performance is not satisfactory and I am trying to use sentencepiece.