Best Practices for Fine-Tuning OpenNMT Models with Limited Data

Hi everyone,

I’ve been experimenting with OpenNMT for a translation project, and I’m facing some challenges with fine-tuning a pre-trained model using a relatively small dataset. My dataset consists of around 10,000 sentence pairs in a low-resource language, and I want to make sure I’m taking the right steps to achieve good performance without overfitting.

Here are a few specific questions I’m hoping the community can help with:

  1. Batch Size and Learning Rate: What batch size and learning rate would you recommend when fine-tuning on such a small dataset? Should I start with the default values or adjust them based on dataset size?
  2. Regularization Techniques: Are there specific regularization techniques (e.g., dropout) that work particularly well in scenarios with limited data?
  3. Preprocessing: My data is already tokenized and cleaned, but would applying additional techniques like subword tokenization (e.g., BPE) offer noticeable improvements?
  4. Evaluation Metrics: What’s the best way to monitor progress during fine-tuning? BLEU score? Loss? Or is there something else I should focus on?

I check this: https://forum.opennmt.net/t/fine-tune-opennmt-model-on-domaiDevOpstraining But I have not found any solution. Could anyone guide me about this? If anyone has successfully fine-tuned models under similar conditions, I’d love to hear about your experiences and any tips you can share.

Thanks in advance for your help!

Here are the answers:

  1. For small dataset use small batches: 16-32, but you can experiment with smaller sizes (e.g., 8 or 4) if you encounter memory limitations or if your model overfits. A learning rate of 0.00001 to 0.001 could be a good starting point. I would suggest testing a few values (e.g., 0.00001, 0.00005, and 0.0001) and using a learning rate scheduler to adjust it dynamically during training.
  2. A dropout rate of around 0.1–0.3 should work well. Also try weight decay (L2 regularization), typical value is 0.01 or 0.001
  3. Use BPE, it will help to handle out-of-vocabulary especially for low-resource languages.
  4. For evaluation metrics we use COMET only.