OpenNMT-tf procedure for fine-tuning/domain adaptation?


I have been trying to set up fine-tuning/domain adaptation for an OpenNMT-tf Transformer model. I have trained a decent general-domain base model using open source data and would now like to customise it to this IT-domain data:

98428 English IT-related sentences in indomain_train_en.txt
98428 French IT-related sentences in indomain_train_fr.txt

And eventually test the tuned model on:

2500 English IT-related sentences in indomain_test_en.txt
2500 French IT-related sentences in indomain_test_fr.txt

However, I’m having issues with updating the vocabulary and kicking off the tuning run. I have already trained a SentencePiece model on the original general-domain training data, and when I try to update the vocabulary, there doesn’t seem to be any output.

My attempt (after general base model has converged) looks something like this:

Average checkpoints

onmt-average-checkpoints --model_dir=experiments/transformer/ --output_dir=experiments/transformer/avg --max_count=5

Apply trained SentencePiece model to in-domain training data

spm_encode --model=sp.model < indomain_train_en.txt > indomain_train_en.token

spm_encode --model=sp.model < indomain_train_fr.txt > indomain_train_en.token

Update vocabulary

onmt-update-vocab --model_dir=experiments/transformer/avg/ --output_dir=experiments/transformer/avg/finetuned/ --src_vocab=data/general.vocab --tgt_vocab=finetuned.vocab

However, nothing shows up in experiments/transformer/avg/finetuned/, although the docs say that this is “The output directory where the updated checkpoint will be saved.”, there is no finetuned.vocab file created. I would have expected some sort of updated checkpoint file and the new vocab file to be created.

Perhaps I’m misunderstanding something… some questions:

  • How can I update the vocabulary successfully? What am I doing wrong here?
  • What exactly is happening internally when the vocabulary is updated?
  • Am I meant to train a new SentencePiece model for new data that I want to adapt the model to? I would have thought that this model should remain the same as the one trained from the general/base data.

And maybe most importantly:

  • How can I kick off the training once the vocabulary is updated? Is the correct command something like this:
onmt-main train_and_eval --model_type Transformer --auto_config --config config.yml --checkpoint_path model.ckpt-50000

where some lines in the config YAML file should be updated to contain the tuning/in-domain data and vocab files like so:

model_dir: experiments/transformer/

  train_features_file: indomain_train_en.token
  train_labels_file: indomain_train_fr.token
  eval_features_file: indomain_test_en.token
  eval_labels_file: indomain_test_fr.token
  source_words_vocabulary: experiments/transformer/finetuned/finetuned.vocab
  target_words_vocabulary: experiments/transformer/finetuned/finetuned.vocab

Thanks a lot in advance, this has been puzzling me.


You are missing some options. You should provide the current and new vocabularies:

  --src_vocab SRC_VOCAB
                        Path to the current source vocabulary. (default: None)
  --tgt_vocab TGT_VOCAB
                        Path to the current target vocabulary. (default: None)
  --new_src_vocab NEW_SRC_VOCAB
                        Path to the new source vocabulary. (default: None)
  --new_tgt_vocab NEW_TGT_VOCAB
                        Path to the new target vocabulary. (default: None)

Some model weights depend on the vocabulary: the embedings and the softmax weights. The script resizes these matrices to the new vocabulary size and copy the learned representation of words that are still present in the new vocabulary.

You should keep the same SentencePiece model.

There are 2 ways:

  • “simple” mode: just change model_dir to the new checkpoint directory and rerun the same command you used for the initial training
  • “expert” mode: set model_dir to a new directory and set --checkpoint_path to the finetuned checkpoint. This will start a new training (with new optimizer settings and schedules) but load model weights from the checkpoint. This should only be used if you know precisely what you want to do (e.g. change the learning rate, etc.)

Thanks for your response. In case others were confused - to generate the new source and target vocabularies, you have to run:

spm_encode --generate_vocabulary --model sp.model < indomain_train_en.txt > indomain_train_en.vocab

For both the source and target text files. You can then run the onmt-update-vocab command as I was trying to above.

I’ve run into another issue - I then tried to start the tuning run, by changing my config.yml file to include the updated model location and the new train/eval data:

model_dir: experiments/transformer/avg/updated

  train_features_file: indomain_train_en.txt
  train_labels_file: indomain_train_fr.txt
  eval_features_file: indomain_eval_en.txt
  eval_labels_file: indomain_eval_fr.txt
  source_words_vocabulary: indomain_train_en.vocab
  target_words_vocabulary: indomain_train_fr.vocab

And then I rerun the same command I used before to launch training:

CUDA_VISIBLE_DEVICES=0,1,2 onmt-main train_and_eval --model_type Transformer                             --config config_tuning.yml  --auto_config --num_gpus 3

It initially looks like this works, but I then get a long error message, starting with:

WARNING:tensorflow:You provided a model configuration but a checkpoint already exists. The model configuration must define the same model as the one used for the initial training. However, you can change non structural values like dropout.

And ending with:

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

tensor_name = optim/learning_rate; expected dtype float does not equal original dtype double
[[node save/RestoreV2 (defined at /usr/local/lib/python3.5/dist-packages/opennmt/ = RestoreV2[dtypes=[DT_INT64, DT_FLOAT, DT_INT64, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Any idea what might be happening here? The model type is the same (Transformer) and all I changed in the config was the updated model repo, the training/eval datasets, and the vocab files.

Thanks again,

What is your OpenNMT-tf version? I think this issue was fixed recently.

I am running OpenNMT-tf==1.15.0 and tensorflow-gpu==1.12.0, building from the Docker image tensorflow/tensorflow:latest-gpu-py3.

Potentially related: when making the new vocabularies, is it necessary to manually process the .vocab files that SentencePiece outputs? I see in one of your scripts that you processed some original (not tuning-specific) .vocab files like this:

# We keep the first field of the vocab file generated by SentencePiece and remove the first line <unk>
cut -f 1 wmt$sl$tl.vocab | tail -n +2 > data/wmt$sl$tl.vocab.tmp

# we add the <blank> word in first position, needed for OpenNMT-TF
sed -i '1i<blank>' data/wmt$sl$tl.vocab.tmp

# Last tweak we replace the empty line supposed to be the "tab" character (removed by the cut above)
perl -pe '$/=""; s/\n\n/\n\t\n/;' data/wmt$sl$tl.vocab.tmp > data/wmt$sl$tl.vocab
1 Like

For this one, should I just put the old vocab in my config file or should it be the new vocab? Thanks

You should set the vocabulary that is compatible with your model. This option simply loads weights without converting or resizing anything.