Extract / Dissect / Freeze Transformer's Encoder Decoder Components

Hello Community,

I am currently working on reproducing some low-resource neural machine translation approach from Kim et al. 2019, where a pivot language was introduced as a bridge between source and target, and some kind of transfer learning is utilized to improve translation results. For this Kim et al. first train the src-pivot model on a shared src-pivot vocabulary, freeze/extract their encoder, and afterwards train their pivot-tgt model using the first encoder. Finally, the encoder of the first step and the decoder of the second step are being assembled into a third, final model that is trained on the src-tgt corpus. To be honest, I do not really expect to improve on their results but as a student I rather wish to learn about the whole process.

Currently, I am trying to use the opennmt-py library to obtain these models and components. I have managed to get a BPE model+vocab with the help of SentencePiece and produced the first model.pt files from my data and am now wondering if I will be able to assemble said 3rd, final model with the help of the torch library. I am wondering if something like this pseudo-code is too naive or basically the right way as long as I correctly identify the corresponding components.

# Load the model
src_pivot_checkpoint = torch.load(src_pivot_model_path)
pivot_tgt_checkpoint = torch.load(pivot_tgt_model_path)


# Create a new model combining src-pivot encoder and pivot-tgt decoder
final_model = {
    'encoder': src_pivot_checkpoint['model']['encoder'],
    'decoder': pivot_tgt_checkpoint['model']['decoder']
}

Has someone done something similar with openNMT and can share some directions or experiences? Is this similar to domain adaption? I’d really like to learn more about the process so any hints are welcome :slight_smile: ! Sadly, all forum posts I could find on the topic did not have any replies or I just did not look for the right thing.

Thank you for your help and time, and best regards,
Jonny

I spoke with my Professor, and it seems that it is ‘just that easy’ and I can simply save and exchange the tensors as long as they have the right dimensions. I’m still curious if someone can point me at directions regarding these operations and models from onmt. Has someone done something similar? Is this discussed on the forums?

Cheers,
Jonny

Hi Jonny,

What would happen if you do this in one step, by training a multi-source model? This feature is supported by OpenNMT-tf.

Another approach is to use a pre-trained multilingual model that supports all of these languages (e.g. NLLB or MADLAD), and fine-tuned it for your low-resource language.

All the best,
Yasmin

Hey @ymoslem,

thanks again for your comment.

For reference, I am trying to do (3.1.) Step-wise Pre-training as described in Kim et. al (2019).

I was wondering which version openNMT-py or openNMT-tf I should be using for my experiments and opted for py+torch, since it’s the level of abstraction I most likely am sort of able to comprehend and work with. Do you reckon investing in the tf version instead? From what I’ve been able to read in the documentation so far, adapting wouldn’t make a large difference but I do not really plan on resorting to build custom procedures when it comes to model architecture or training. That is technically out of scope for me at this point. Right now, I have worked out the structure of my data and am able to successfully use the onmt_train command to train what would be the both source-pivot and pivot-tgt model on arbitrary configurations regarding the amount of data I specify. I believe if I learn more about fine-tuning with onmt, the last step, i.e. combining enc+dec into a final model, is not out of reach.
I do not know if I have the kind of control I am looking for if I train a multi-source model. Not because the kind of control isn’t available but because I don’t know if I can technically manage it. Currently, I am treating openNMT like a tool that needs specific arguments to train a translation model and am just putting in the work to provide these arguments. My hope is that I can with certainty report in my thesis what exactly happened where and why because I implemented it that way (while learning about the specific steps each process requires).

Nevertheless, it’d be also really interesting to learn it or at least hear about how to do it the proper way. I looked up multi-source models/training and read a few forum posts about it but I can’t say it rings a bell. Would I introduce some custom symbol for each of my languages and build a shared vocab between all of them while trying to control the sequence in which the model sees the training data, and is this what the opennmt-tf documentation is referring to as parallel and nested inputs? I’ve the feeling that I’m on really thin ice when it comes to concepts beyond whats obvious/introductory to be honest.

Cheers and a refreshing weekend :slight_smile: ,
Jonny