Finetuning with the new version OpenNMT-py v2.0

anderleich · February 15, 2021, 8:58am

Hi,

I’m interested in finetuning a general domain model with a specific domain. I’ve previously finetuned some models using OpenNMT-py’s previous version. However, I can’t find any information for the newest version. Can anyone give me some keys?

Thanks

francoishernandez · February 15, 2021, 11:28am

The underlying model and training structure did not change in 2.0, just the data loading process. You can check how the config files work in the updated docs, and just use train_from to start from your model, and src_vocab to keep the same vocab.

If facing specific issues or questions, you might want to be more precise.

anderleich · February 15, 2021, 11:31am

Should I run onmt_build_vocab or just onmt_train with train_from, src_vocab and tgt_vocab options?

francoishernandez · February 15, 2021, 11:32am

If you want to finetune from an existing model, you already have a vocab, so you don’t want to run onmt_build_vocab.

anderleich · February 15, 2021, 11:34am

Understood! Thanks!

anderleich · February 15, 2021, 11:40am

When finetuning an averaged model it starts from the 0th step. Is it correct?

[2021-02-15 12:35:17,631 INFO] Step 50/700000; acc:  80.27; ppl:  2.30; xent: 0.84; lr: 0.00001; 7529/7340 tok/s;     87 sec                                                                                
[2021-02-15 12:36:43,872 INFO] Step 100/700000; acc:  80.82; ppl:  2.25; xent: 0.81; lr: 0.00001; 7889/7668 tok/s;    173 sec                                                                              
 [2021-02-15 12:38:10,487 INFO] Step 150/700000; acc:  80.38; ppl:  2.30; xent: 0.83; lr: 0.00002; 7502/7334 tok/s;    260 sec                                                                               
[2021-02-15 12:39:28,266 INFO] Step 200/700000; acc:  80.46; ppl:  2.28; xent: 0.83; lr: 0.00002; 8520/8321 tok/s;    337 sec

francoishernandez · February 15, 2021, 12:02pm

Yes, because when averaging all the optimizer components are removed from the checkpoint.

anderleich · February 15, 2021, 12:03pm

So, maybe, it is not the best checkpoint to train from? As we lose optimizer’s information

francoishernandez · February 15, 2021, 12:05pm

Should get back on track fairly quickly.

anderleich · February 15, 2021, 12:08pm

Thanks! I really appreciate your help

anderleich · February 23, 2021, 2:35pm

Hi again,

Can we finetune with a different vocabular? Does it randomly initialize new words’ embeddings?

francoishernandez · February 23, 2021, 3:50pm

The case is not explicitly handled (yet) in OpenNMT-py.
Technically you could ‘update’ the vocabulary of an existing model and finetune from there. You would indeed need to initialize some parameters of the model for the new tokens.

It would be nice to add this as a standalone script. Let us know if you would like to contribute and need pointers to get started.

anderleich · February 26, 2021, 9:31am

What if I rerun onmt_build_vocab and then use the train_from flag? Am I expected to get an error?

francoishernandez · February 26, 2021, 10:20am

To train a model, you need a vocab. Because parameters of the network are tied to a specific input/output index corresponding to a word/token. So, an existing model has a fixed vocab, in the sense that it expects a range of indices in input, and produces a range of indices in output.
You can technically pass a new vocab when using train_from. But your vocab will probably not be the same size and produce an error. And, if the vocab is the same size, the indices will probably not match so the model won’t train properly. (E.g. the index for “banana” would now be the index for “beach”, and your model would have to learn everything again.)

Note that build_vocab is merely a helper tool to prepare a vocab (basically a list of words/tokens), but the vocab passed to train could be built by any other tool as long as it’s in the proper format.

anderleich · February 26, 2021, 10:44am

Thanks. That was really clarifying, I suspected it would be so but I wanted to be sure.

When finetuning on a new domain my approach is to use weighted corpora in training and a mix of in domain and out domain corpora for evaluation:

Training data: out of domain (60%) and in domain (40%)
Evaluation data: out of domain (50%) and in domain (50%)

During the training process, BLEU keeps improving on the evaluation set:

Just to see what’s going on with the test, I periodically test the model on in domain data. What I noticed is that it seems to have in domain kwonledge until some point in the training which improves the baseline. However, from that point, even if the model improves BLEU on development data the model starts performing worse on in domain test. Any clues why I get this behaviour? Is it because I have out of domain data in the evaluation set? It happens with different percentages of weighted training data…

Thanks

francoishernandez · February 26, 2021, 11:05am

use weighted corpora in training

Do you use the weight attributes in the training configuration or do you build your dataset beforehand by aggregating in-domain/out-of-domain data?

anderleich · February 26, 2021, 11:26am

I use weights:

data:
       out-domain:
              weight: 6
       in-domain:
             weight: 4

francoishernandez · February 26, 2021, 11:33am

Ok so there is no shuffling issue.
I don’t really know what might be happening here. Maybe your in-domain test set is not representative of your in-domain train data (or the opposite). Did you try weighting your in-domain dataset even more (like out-domain 1 and in-domain 10)?

anderleich · February 26, 2021, 11:47am

My in-domain test is a subset of the in-domain training data. I should try 1-10 weights to see its behaviour.

Thanks

anderleich · March 24, 2021, 10:25am

Hi @francoishernandez,

You mentioned it would be nice to add new tokens in vocabulary as a standalone script. Do you mena something similar to build_vocab.py? Apply transforms to the new corpus and update src and tgt counters and the crresponding vocabulary files?

I’m interested in implementing this feature.

EDIT: I’ve edited with minimal changes build_vocab.py to accept a new argument --update_vocab to update existing vocabulary files with new corpora. I still need to edit the training scripts.

Thanks