NLLB fine-tuning on very low corpus dialect

Hello hope you are all well

I am new here, I am working on fine-tuning the NLLB200 model for a very small dialect ‘A’ of a larger dialect ‘B’. The dialects A & B are close to language C but quite different. However I managed to build a corpus of translated content A<>C of 7k+ sentences and 40k+ words. To be noted, B and C are already supported by the pre-trained NLLB200 model, and A is mostly non-existent on the internet.

I have an acceptable working version by fine-tuning the NLLB200-distilled-1.3B with the corpus A<>C I already have. I am wondering if I should duplicate this corpus by translating sentences and words from C into B, and adding the equivalent translated corpus A<>B to the fine-tuning.

Would it help my fine-tuned model performance since A is a dialect of B, and there is closeness in the grammar/spirit of both languages? Would I be better by training with both A<>C & A<>B, with A<>C only, or with A<>B only?

Open for any feedback or suggestion on what I should put in place to fine-tune! For info the dialects are from the upper mediterranean region. Thanks a lot in advance :slight_smile: