Domain adaptation procedure (ONMT-tf 2.0)

Hi,

I am having trouble recreating domain adaptation with the new version of ONMT-tf (I am using OpenNMT-tf==2.8.0).

Specifically, I have found these steps confusing:

  • The vocabulary update step
  • Starting adaptation training with a merged vocab model

What I’ve done so far

Train a general domain model. I have trained a good general-domain Es->Pt model until convergence and kept the best checkpoint (averaged) in model_dir.

Tokenise in-domain data. I tokenised the new in-domain source and target files using my previously trained tokeniser, and generated my new in-domain train, eval, and test files this way.

Generate in-domain vocab files. Next, I have tried (is this correct?) to generate new vocabulary files for the in-domain data:

# build vocab for in-domain spanish file
onmt-build-vocab \
--tokenizer_config tok_config.es-pt.es.yml \
--save_vocab in-domain.train.es-pt.es.vocab \
in-domain.train.es-pt.es

# build vocab for in-domain portuguese file
onmt-build-vocab \
--tokenizer_config tok_config.es-pt.pt.yml \
--save_vocab in-domain.train.es-pt.pt.vocab \
in-domain.train.es-pt.pt

Update model vocab. I next tried to update the general domain model’s vocabulary. I couldn’t actually find an example of how to do this, so I tried different versions of the command and got this to run (I assumed old source and target vocabs were inferred from the config file, and provided my new vocab files as --src_vocab and --tgt_vocab):

onmt-main --model_type Transformer \
--config '../config.yml' --auto_config \
update_vocab \
--src_vocab='in-domain.es-pt.es.vocab'  \
--tgt_vocab='in-domain.es-pt.pt.vocab'  \
--output_dir 'merged/'

This command did make the merged/ directory with a checkpoint (numbered at ckpt-1?) but this did not generate any merged vocab files. So, I think I have made a mistake somewhere.

Start adaptation. I made a new config file for adaptation, specifying the in-domain vocab since I didn’t have any merged vocab files:

model_dir: merged/

data:
  train_features_file: in-domain.train.es-pt.es.token
  train_labels_file: in-domain.train.es-pt.pt.token
  eval_features_file: in-domain.dev.es-pt.es.token
  eval_labels_file: in-domain.dev.es-pt.pt.token
  source_vocabulary: in-domain.es-pt.es.vocab
  target_vocabulary: in-domain.es-pt.pt.vocab

train:
  save_checkpoints_steps: 1000
  keep_checkpoint_max: 50

eval:
    steps: 1000
    save_eval_predictions: True
    external_evaluators: bleu

infer:
  batch_size: 32

I then launched training as normal. Although this runs, I think what this actually does is start a training run from scratch - which misses the point of doing domain adaptation.

Thanks, any advice you might have would be great.

Cheers,
Natasha

1 Like

Hi Natasha,

Thanks for providing all the details.

First, please note that the new version no longer implements vocabulary merging. We found it to be confusing and error prone. Instead the vocabularies that you pass to update_vocab are the actual vocabularies that will be used in the transformer checkpoint. So if you want to replicate the previous merging logic, you would need to merge the vocabularies with the tool and approach of your choice.

In general, it is easier to just train a subword model on generic data and reuse the same model for domain adaption. This would save you from all these vocabularies manipulation.

Seems odd that the numbering starts again with 1. When running update_vocab, the current model directory contained the averaged checkpoint, right? Was this checkpoint averaged with OpenNMT-tf 1.x or 2.x?

What is the training loss that is reported at the start of the domain adaption?

Hi Guillaume,

Thanks for your quick response :slight_smile:

So the idea is that the subword tokens learned from the generic data should be enough to cover any changes in vocabulary in the in-domain data? That sounds fine to me, happy to be able to skip the vocab merging step.

Could you please confirm: does this mean that in ONMT-tf 2.0, I can skip the onmt-build-vocab step for in-domain source/target data, and also skip the onmt-main ... update_vocab step and go straight to starting adaptation on an averaged checkpoint with just a new config.yml file containing the new train and eval features/label filenames?

When running update_vocab , the current model directory contained the averaged checkpoint, right? Was this checkpoint averaged with OpenNMT-tf 1.x or 2.x?

Correct, the current model dir contained the averaged checkpoint. The checkpoint was averaged with OpenNMT-tf 2.x.

What is the training loss that is reported at the start of the domain adaption?

The training loss was high (~9), similar to the values seen when starting the general domain model, which led me to believe this was a model being trained from the beginning rather than being adapted.

Thanks again,
Natasha

Yes, in this case you can directly start training on new data.

Can you generate the vocabulary from both the parent data and in-domain data before training the parent model?

Sure, you can do that.

Came back to this task after a while on other things. A very useful discussion, thanks :slight_smile: