I am using transformer model with pytorch to translate hindi to english sentences. I trained my model with some 1million parallel corpus.
I am facing one issue which I am not able to understand. When I translate some sentence say of length 35-40 words, first of all it is not translated correctly. That’s okay. But, when I cut my sentence into half, say 20 words long using the full stop and then translate it, then I get completely different word for the corresponding hindi word.This word is wrong as well as different than the one which I got when I translated using full length sentence .
Example:
original Hindi sentence : अविश्वास के प्रस्ताव में उन कारणों का उल्लेख नहीं होता जिन पर वह आधारित हो परंतु निंदा प्रस्ताव में ऐसे कारणों या आरोपों का उल्लेख करना आवश्यक होता है और यह प्रस्ताव कतिपय नीतियों और कार्यों के लिए सरकार की निंदा करने के विशिष्ट प्रयोजन से पेश किया जाता है .
full translation : The motion of no - confidence does not mention the reasons on which it is based but the censure motion calls for such reasons or allegations and the motion is moved for the particular purpose of specialties the government for certain policies and actions .
google translation: There is no mention of the reasons for which they are based, but it is necessary to mention such reasons or accusations in the motion of condemnation and this proposal is presented for specific purposes of condemning the government for certain policies and actions.
half sentence: यह प्रस्ताव कतिपय नीतियों और कार्यों के लिए सरकार की निंदा करने के विशिष्ट प्रयोजन से पेश किया जाता है .
translation : It is offered for the specific purpose of कतिपय the government for certain policies and actions .
कतिपय : means specific
निंदा: condemn
My doubt is if my model is trained with these two words, why it is not able to always translate these two words. For full sentence, it is giving “specialites” in place of “condemn” and correclty translating कतिपय. For short one, it is translating कतिपय to specific which is correct but still again it is putting the word in hindi as well and it is not able to translate निंदा.
I feel this is little weird behaviour. Any suggestion, help in this regard. I have also build a website demonstraing the same, i can share if anyone is willing to help and understand the issue.
Hi Ajitesh,
I am using same transformer for hindi to english translation.
I have some doubts on tokenization & embeddings , can we communicate through email ?
can you please tell where did you find the dataset(1M parallel corpus) for english to hindi…we are working on the same,we need more data…Hope you respond!
We thought of using that corpus… But the data is not in general language… It has all technical terms. We thought it would not work for general translation… Please share your thoughts on the same
depends, for some simple easy sentences it is good, but as the length increases it worsens. More importantly, it is unable to give good translation for part of a sentence rather than full sentences, in such cases