After training a full Transformer model (about 6M lines) on OpenNMT-py, if I would like to finetune it on a particularly small individual corpuses that range between 2000-10000 lines, what kind of settings do I need to change? I’ve already finetuned some models, but would like more advice on the parameters needed.
Currently, I’m using the code mentioned below. It seems there’s a lot of catastrophic forgetting past ~1000-2000 steps (probably due to the small corpus), so I generally train 2000 steps, saving in steps of 250. In the original Transformer model, the warmup steps is at 8000, would there be a need to change it?
Generally in other ML models, finetuning is done on 1-2 epochs. With a batch size of 4096, is there some rule of thumb conversion to say how many steps would be best for 1 GPU?
batch_size * GPUs * accum_count = 1 step? If so, 1 step of (4096,1,2) on a 2000-line/example (with average 50 tokens) corpus is actually ~1 epoch?
python3 train.py -data data/FineTuned -save_model model/FineTuned -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 1000 -max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 100 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 250 -save_checkpoint_steps 10000 -world_size 1 -gpu_ranks 0 -tensorboard_log_dir logs/ -tensorboard -train_from model/v2