Best Practices for Fine-Tuning OpenNMT Models with Limited Data

Hi everyone,

I’ve been experimenting with OpenNMT for a translation project, and I’m facing some challenges with fine-tuning a pre-trained model using a relatively small dataset. My dataset consists of around 10,000 sentence pairs in a low-resource language, and I want to make sure I’m taking the right steps to achieve good performance without overfitting.

Here are a few specific questions I’m hoping the community can help with:

  1. Batch Size and Learning Rate: What batch size and learning rate would you recommend when fine-tuning on such a small dataset? Should I start with the default values or adjust them based on dataset size?
  2. Regularization Techniques: Are there specific regularization techniques (e.g., dropout) that work particularly well in scenarios with limited data?
  3. Preprocessing: My data is already tokenized and cleaned, but would applying additional techniques like subword tokenization (e.g., BPE) offer noticeable improvements?
  4. Evaluation Metrics: What’s the best way to monitor progress during fine-tuning? BLEU score? Loss? Or is there something else I should focus on?

I check this: https://forum.opennmt.net/t/fine-tune-opennmt-model-on-domaiDevOpstraining But I have not found any solution. Could anyone guide me about this? If anyone has successfully fine-tuned models under similar conditions, I’d love to hear about your experiences and any tips you can share.

Thanks in advance for your help!