I currently have access to around 35k English translations of classical Arabic texts (~3,000,000 English and Arabic tokens, per language). All of these texts belong to a specific literary genre, known as hadith. This genre is the focus of my translation project.
Unfortunately, there are several issues with my training data. The translations that I have are not sentence-aligned and are not entirely ‘parallel’. Some translations simply summarise the narrative of a given hadith. Others translate exact phrases in the hadith, but skip the first part of the hadith (known as the isnad; this is the chain of narrators for the oral source (hadith)). Programming an algorithm from scratch to do sentence alignment would be quite difficult because classical Arabic has limited punctuation.
I have tried training an OpenNMT with entire translations stored on a single line but that did not seem to work (possibly because I forgot to change the parameters for sentence length and truncation). I could try changing the parameters for onmt training, but I think I may need to align my training dataset first (if I want to have better translations).
Is anyone aware of possible solutions to problems like this? Any help is appreciated
P.S. I apologise if this post does not accurately represent the problem I am facing, I am still learning NMT terminology.