Domain Adaptation with OpenNMT-tf

iriniz · November 27, 2020, 1:26pm

Hello,

I have a pre-trained model trained on a big corpus and I tried to continue the training with the new data from a smaller corpus in order to achieve domain adaptation.

I am a little bit confused with the steps that I followed and I would like to be sure that I haven’t done any mistake.

First of all, I built the new vocab of the new smaller corpus with the following command:

CUDA_VISIBLE_DEVICES=0 onmt-build-vocab --tokenizer_config path/tok.yml --size 30000 --save_vocab path/src-sp-vocab path/src-train.txt
CUDA_VISIBLE_DEVICES=0 onmt-build-vocab --tokenizer_config path/tok.yml --size 30000 --save_vocab path/tgt-sp-vocab path/tgt-train.txt

Then, I used the following command in order to update the vocab:
CUDA_VISIBLE_DEVICES=0 onmt-main --model_type Transformer --config path/data.yml --auto_config update_vocab --output_dir path/output/ --src_vocab=path/output/src-sp-vocab --tgt_vocab=path/output/tgt-sp-vocab

The output of the command was the last checkpoint of the initial pre-trained model.

Then, I replaced the model_dir in the config file with the new output folder that contained the last checkpoint of the initial pre-trained model and I used the following command in order to start the training:

CUDA_VISIBLE_DEVICES=0 onmt-main --model_type Transformer --config path/data.yml --auto_config --checkpoint_path path/output/ckpt-20000.index --mixed_precision train --with_eval

As you may see, I also added the new checkpoint path (which is the same path used for the model_dir).

Finally, the training started and I got the following statement:

“You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout.”

Although, I am not sure if I re-trained the initial model or if I started a totally new training process. How can I be sure about it?

Thank you a lot in advance.

guillaumekln · November 27, 2020, 1:45pm

Hi,

The procedure looks mostly correct but I would suggest you to read:

In this topic we propose an easier procedure that does not require updating the vocabularies:

Generate subword models and vocabularies from the full data
Train on generic data
Continue training on in-domain data

For 3., you might also want to use weighted datasets to continue training on a mix of generic and in-domain data to avoid forgetting how to translate generic sentences.

iriniz · December 1, 2020, 1:30pm

Thank you a lot for your prompt reply! I’ll give it a try.