I am having trouble recreating domain adaptation with the new version of ONMT-tf (I am using OpenNMT-tf==2.8.0).
Specifically, I have found these steps confusing:
- The vocabulary update step
- Starting adaptation training with a merged vocab model
What I’ve done so far
Train a general domain model. I have trained a good general-domain Es->Pt model until convergence and kept the best checkpoint (averaged) in
Tokenise in-domain data. I tokenised the new in-domain source and target files using my previously trained tokeniser, and generated my new in-domain train, eval, and test files this way.
Generate in-domain vocab files. Next, I have tried (is this correct?) to generate new vocabulary files for the in-domain data:
# build vocab for in-domain spanish file onmt-build-vocab \ --tokenizer_config tok_config.es-pt.es.yml \ --save_vocab in-domain.train.es-pt.es.vocab \ in-domain.train.es-pt.es # build vocab for in-domain portuguese file onmt-build-vocab \ --tokenizer_config tok_config.es-pt.pt.yml \ --save_vocab in-domain.train.es-pt.pt.vocab \ in-domain.train.es-pt.pt
Update model vocab. I next tried to update the general domain model’s vocabulary. I couldn’t actually find an example of how to do this, so I tried different versions of the command and got this to run (I assumed old source and target vocabs were inferred from the config file, and provided my new vocab files as
onmt-main --model_type Transformer \ --config '../config.yml' --auto_config \ update_vocab \ --src_vocab='in-domain.es-pt.es.vocab' \ --tgt_vocab='in-domain.es-pt.pt.vocab' \ --output_dir 'merged/'
This command did make the
merged/ directory with a checkpoint (numbered at
ckpt-1?) but this did not generate any merged vocab files. So, I think I have made a mistake somewhere.
Start adaptation. I made a new config file for adaptation, specifying the in-domain vocab since I didn’t have any merged vocab files:
model_dir: merged/ data: train_features_file: in-domain.train.es-pt.es.token train_labels_file: in-domain.train.es-pt.pt.token eval_features_file: in-domain.dev.es-pt.es.token eval_labels_file: in-domain.dev.es-pt.pt.token source_vocabulary: in-domain.es-pt.es.vocab target_vocabulary: in-domain.es-pt.pt.vocab train: save_checkpoints_steps: 1000 keep_checkpoint_max: 50 eval: steps: 1000 save_eval_predictions: True external_evaluators: bleu infer: batch_size: 32
I then launched training as normal. Although this runs, I think what this actually does is start a training run from scratch - which misses the point of doing domain adaptation.
Thanks, any advice you might have would be great.