Transformer for NMT(Hindi-english)

I am using transformer model with pytorch to translate hindi to english sentences. I trained my model with some 1million parallel corpus.
I am facing one issue which I am not able to understand. When I translate some sentence say of length 35-40 words, first of all it is not translated correctly. That’s okay. But, when I cut my sentence into half, say 20 words long using the full stop and then translate it, then I get completely different word for the corresponding hindi word.This word is wrong as well as different than the one which I got when I translated using full length sentence .
Example:
original Hindi sentence : अविश्वास के प्रस्ताव में उन कारणों का उल्लेख नहीं होता जिन पर वह आधारित हो परंतु निंदा प्रस्ताव में ऐसे कारणों या आरोपों का उल्लेख करना आवश्यक होता है और यह प्रस्ताव कतिपय नीतियों और कार्यों के लिए सरकार की निंदा करने के विशिष्ट प्रयोजन से पेश किया जाता है .
full translation : The motion of no - confidence does not mention the reasons on which it is based but the censure motion calls for such reasons or allegations and the motion is moved for the particular purpose of specialties the government for certain policies and actions .

google translation: There is no mention of the reasons for which they are based, but it is necessary to mention such reasons or accusations in the motion of condemnation and this proposal is presented for specific purposes of condemning the government for certain policies and actions.

half sentence: यह प्रस्ताव कतिपय नीतियों और कार्यों के लिए सरकार की निंदा करने के विशिष्ट प्रयोजन से पेश किया जाता है .
translation : It is offered for the specific purpose of कतिपय the government for certain policies and actions .

कतिपय : means specific
निंदा: condemn

My doubt is if my model is trained with these two words, why it is not able to always translate these two words. For full sentence, it is giving “specialites” in place of “condemn” and correclty translating कतिपय. For short one, it is translating कतिपय to specific which is correct but still again it is putting the word in hindi as well and it is not able to translate निंदा.
I feel this is little weird behaviour. Any suggestion, help in this regard. I have also build a website demonstraing the same, i can share if anyone is willing to help and understand the issue.

It usually comes down to the training data.

What tokenization are you applying?

indic NLP tokenizer. I first tokenize using this and then preprocess and train(transformer) using the Open py script

Are you 100% sure your dataset is really clean?
Also what model did you train and how long ? (post your cmd line)

Yes, clean as in I have done the basic preprocessing as standards. Below is my training command-

python train.py -data data/proData_230104 -save_model model/model_230104-model -layers 6 -rnn_size 512 -word_vec_size 512 -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 100000 -max_generator_batches 2 -dropout 0.1 -batch_size 6000 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 1 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -world_size 1 -gpu_ranks 0

python train.py -data data/procData_010105 -save_model model/model_010105-model -layers 6 -rnn_size 512 -word_vec_size 512 -pre_word_vecs_enc "data/embeddings_010105.enc.pt" -pre_word_vecs_dec "data/embeddings_010105.dec.pt" -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 100000 -max_generator_batches 2 -dropout 0.1 -batch_size 6000 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 0.25 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 10000 -save_checkpoint_steps 10000 -world_size 1 -gpu_ranks 0

Hi Ajitesh,
I am using same transformer for hindi to english translation.
I have some doubts on tokenization & embeddings , can we communicate through email ?

hi santosh. you can communicate with me on ajitesh1993@gmail.com

can you please tell where did you find the dataset(1M parallel corpus) for english to hindi…we are working on the same,we need more data…Hope you respond!

its from IIT bombay. Google it you would get it

We thought of using that corpus… But the data is not in general language… It has all technical terms. We thought it would not work for general translation… Please share your thoughts on the same

yes, the initial corpus is junky after 1.5 lakhs I think it is useable. You can clean it

so had your model worked fine for general human conversation translation?

depends, for some simple easy sentences it is good, but as the length increases it worsens. More importantly, it is unable to give good translation for part of a sentence rather than full sentences, in such cases

1 Like