Model compression via weight tying


I may have missed something, but I could not find any methods for model compression in the current codebase.

I’d like to implement the three way weight tying method shown here, and I’d be glad if someone could point me to the right places in the codebase.
Whats needed is to replace the 2 embedding matrices of the target/source vocabulary and the softmax matrix with one single matrix.

(srush) #2

Yeah, you could implement this. Just set the final layer .weight to the word embedding .weight matrix.

Here is also a write-up that we have been using: Sequence-Level Knowledge Distillation