Hi,
I have been trying to set up fine-tuning/domain adaptation for an OpenNMT-tf Transformer model. I have trained a decent general-domain base model using open source data and would now like to customise it to this IT-domain data:
98428 English IT-related sentences in indomain_train_en.txt
98428 French IT-related sentences in indomain_train_fr.txt
And eventually test the tuned model on:
2500 English IT-related sentences in indomain_test_en.txt
2500 French IT-related sentences in indomain_test_fr.txt
However, I’m having issues with updating the vocabulary and kicking off the tuning run. I have already trained a SentencePiece model on the original general-domain training data, and when I try to update the vocabulary, there doesn’t seem to be any output.
My attempt (after general base model has converged) looks something like this:
Average checkpoints
onmt-average-checkpoints --model_dir=experiments/transformer/ --output_dir=experiments/transformer/avg --max_count=5
Apply trained SentencePiece model to in-domain training data
spm_encode --model=sp.model < indomain_train_en.txt > indomain_train_en.token
spm_encode --model=sp.model < indomain_train_fr.txt > indomain_train_en.token
Update vocabulary
onmt-update-vocab --model_dir=experiments/transformer/avg/ --output_dir=experiments/transformer/avg/finetuned/ --src_vocab=data/general.vocab --tgt_vocab=finetuned.vocab
However, nothing shows up in experiments/transformer/avg/finetuned/
, although the docs say that this is “The output directory where the updated checkpoint will be saved.”, there is no finetuned.vocab
file created. I would have expected some sort of updated checkpoint file and the new vocab file to be created.
Perhaps I’m misunderstanding something… some questions:
- How can I update the vocabulary successfully? What am I doing wrong here?
- What exactly is happening internally when the vocabulary is updated?
- Am I meant to train a new SentencePiece model for new data that I want to adapt the model to? I would have thought that this model should remain the same as the one trained from the general/base data.
And maybe most importantly:
- How can I kick off the training once the vocabulary is updated? Is the correct command something like this:
onmt-main train_and_eval --model_type Transformer --auto_config --config config.yml --checkpoint_path model.ckpt-50000
where some lines in the config YAML file should be updated to contain the tuning/in-domain data and vocab files like so:
model_dir: experiments/transformer/
data:
train_features_file: indomain_train_en.token
train_labels_file: indomain_train_fr.token
eval_features_file: indomain_test_en.token
eval_labels_file: indomain_test_fr.token
source_words_vocabulary: experiments/transformer/finetuned/finetuned.vocab
target_words_vocabulary: experiments/transformer/finetuned/finetuned.vocab
Thanks a lot in advance, this has been puzzling me.
Cheers,
Natasha