Hi,
I am having trouble recreating domain adaptation with the new version of ONMT-tf (I am using OpenNMT-tf==2.8.0).
Specifically, I have found these steps confusing:
- The vocabulary update step
- Starting adaptation training with a merged vocab model
What I’ve done so far
Train a general domain model. I have trained a good general-domain Es->Pt model until convergence and kept the best checkpoint (averaged) in model_dir
.
Tokenise in-domain data. I tokenised the new in-domain source and target files using my previously trained tokeniser, and generated my new in-domain train, eval, and test files this way.
Generate in-domain vocab files. Next, I have tried (is this correct?) to generate new vocabulary files for the in-domain data:
# build vocab for in-domain spanish file
onmt-build-vocab \
--tokenizer_config tok_config.es-pt.es.yml \
--save_vocab in-domain.train.es-pt.es.vocab \
in-domain.train.es-pt.es
# build vocab for in-domain portuguese file
onmt-build-vocab \
--tokenizer_config tok_config.es-pt.pt.yml \
--save_vocab in-domain.train.es-pt.pt.vocab \
in-domain.train.es-pt.pt
Update model vocab. I next tried to update the general domain model’s vocabulary. I couldn’t actually find an example of how to do this, so I tried different versions of the command and got this to run (I assumed old source and target vocabs were inferred from the config file, and provided my new vocab files as --src_vocab
and --tgt_vocab
):
onmt-main --model_type Transformer \
--config '../config.yml' --auto_config \
update_vocab \
--src_vocab='in-domain.es-pt.es.vocab' \
--tgt_vocab='in-domain.es-pt.pt.vocab' \
--output_dir 'merged/'
This command did make the merged/
directory with a checkpoint (numbered at ckpt-1
?) but this did not generate any merged vocab files. So, I think I have made a mistake somewhere.
Start adaptation. I made a new config file for adaptation, specifying the in-domain vocab since I didn’t have any merged vocab files:
model_dir: merged/
data:
train_features_file: in-domain.train.es-pt.es.token
train_labels_file: in-domain.train.es-pt.pt.token
eval_features_file: in-domain.dev.es-pt.es.token
eval_labels_file: in-domain.dev.es-pt.pt.token
source_vocabulary: in-domain.es-pt.es.vocab
target_vocabulary: in-domain.es-pt.pt.vocab
train:
save_checkpoints_steps: 1000
keep_checkpoint_max: 50
eval:
steps: 1000
save_eval_predictions: True
external_evaluators: bleu
infer:
batch_size: 32
I then launched training as normal. Although this runs, I think what this actually does is start a training run from scratch - which misses the point of doing domain adaptation.
Thanks, any advice you might have would be great.
Cheers,
Natasha