General Parameters in Fine Tuning

After training a full Transformer model (about 6M lines) on OpenNMT-py, if I would like to finetune it on a particularly small individual corpuses that range between 2000-10000 lines, what kind of settings do I need to change? I’ve already finetuned some models, but would like more advice on the parameters needed.

Currently, I’m using the code mentioned below. It seems there’s a lot of catastrophic forgetting past ~1000-2000 steps (probably due to the small corpus), so I generally train 2000 steps, saving in steps of 250. In the original Transformer model, the warmup steps is at 8000, would there be a need to change it?

Generally in other ML models, finetuning is done on 1-2 epochs. With a batch size of 4096, is there some rule of thumb conversion to say how many steps would be best for 1 GPU?

batch_size * GPUs * accum_count = 1 step? If so, 1 step of (4096,1,2) on a 2000-line/example (with average 50 tokens) corpus is actually ~1 epoch?

python3 train.py -data data/FineTuned -save_model model/FineTuned -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 1000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 100 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 250 -save_checkpoint_steps 10000 -world_size 1 -gpu_ranks 0 -tensorboard_log_dir logs/ -tensorboard -train_from model/v2

The issue is probably more that you’re trying to finetune on a very small indomain dataset, rather than the specific parameters. You probably want to try some “mixed” finetuning (i.e. keeping your original dataset and adding your indomain one with oversampling).

Thank you for the reply. What if the full model already had some of the in-domain included, but due to the different ways of translating certain terms, the full model often can give mixed weights to certain terms. Basically, I’m trying to get certain names and terms right for the in-domain, leaving the sentence structure, etc to the full model. Also, as more and more manual translations in the in-domain are done, I would like to further finetune it on the new data without retraining the full model. Any suggestions?

I don’t think there is a magical solution that would accommodate your need. You don’t need to retrain the full model, only retain some of its original data to prevent catastrophic forgetting.
See Successful Domain Adaptation with OpenNMT-py or Catastrophic Forgetting after Domain Adaption - Ideas to solve for instance.