Based on this paper “Regularization techniques for fine-tuning in neural machine translation”. The idea seems sensible & empirically effective at preventing overfitting during FineTuning on small parallel in-domain data. Is it possible/ easy to test here? Maybe the implementation of it also raises some memory requirement e.g. with the additional ‘out-of-domain’ weights now being maintained during fine-tuning.