Arabic-English Translation with Limited Training Data


I currently have access to around 35k English translations of classical Arabic texts (~3,000,000 English and Arabic tokens, per language). All of these texts belong to a specific literary genre, known as hadith. This genre is the focus of my translation project.

Unfortunately, there are several issues with my training data. The translations that I have are not sentence-aligned and are not entirely ‘parallel’. Some translations simply summarise the narrative of a given hadith. Others translate exact phrases in the hadith, but skip the first part of the hadith (known as the isnad; this is the chain of narrators for the oral source (hadith)). Programming an algorithm from scratch to do sentence alignment would be quite difficult because classical Arabic has limited punctuation.

I have tried training an OpenNMT with entire translations stored on a single line but that did not seem to work (possibly because I forgot to change the parameters for sentence length and truncation). I could try changing the parameters for onmt training, but I think I may need to align my training dataset first (if I want to have better translations).

Is anyone aware of possible solutions to problems like this? Any help is appreciated :slight_smile: :slightly_smiling_face:


P.S. I apologise if this post does not accurately represent the problem I am facing, I am still learning NMT terminology.

you need more data, google “arabic english nmt” you’ll find it.

I’m considering using Stanza to sentencize my texts, Hunalign (alongside an Arabic-English dictionary) to sentence-align the texts, and then training an onmt model on that data. Would it be worth doing this, and then also collecting parallel corpora of classical Arabic texts to train my model, or would it be better to rely solely on the parallel corpora that I find?

Dear Umar,

You can find a lot of datasets at OPUS.

I have also aligned Quran verses to their English explanation by Yusuf Ali. (download link)

Yes, it is worth trying to align the data you have (and please if you do, publish it for others to benefit from it).

My suggestion would be to train a model on some OPUS corpora and then fine-tune it with Quran and Hadith datasets. You can find more details about Domain Adaptation if you search this forum or at:

All the best,