Random behaviour in translation wrt punctuation and proper nouns

Hi, I am trying to translate from hindi to english. There are some things I am unable to understand. My translation is absurdly changing wrt to pucntuation, proper nouns. Below are my examples. Any insight on why this is happening and how to improve it.

Original Hindi Sentence : उपहारस्वरूप शिव खेड़ा की प्रचलित पुस्तक ’जीत आपकी‘ भेज रहा हूँ।
This means: As a gift, I am sending you “Jeet Apki”, the popular book of Shiv Kheda.
here, उपहारस्वरूप = as a gift
शिव खेड़ा = name of the writer (Shiv Kheda)
प्रचलित = popular
पुस्तक = book
’जीत आपकी‘ = name of the book(Jeet Apki)
भेज रहा हूँ = sending

experiment 1.
उपहारस्वरूप शिव खेड़ा की प्रचलित पुस्तक ’जीत आपकी‘ भेज रहा हूँ।
Google translate : I am sending the popular book “Vijay Your” of Shiv Khehra.
My model: "I am sending you a gifted book of Shiva Kheda.

experiment 2: removed the purnviram(full stop in hindi sentence ’ | ’ )
उपहारस्वरूप शिव खेड़ा की प्रचलित पुस्तक ’जीत आपकी‘ भेज रहा हूँ
Google translate: Send ‘Vijay Your’ book of the gifted Shiv Kheda
My model : As a gift, the popular book of Shiv Khedra is sent to you.

experiment 3: used double quotes for the name of the book
उपहारस्वरूप शिव खेड़ा की प्रचलित पुस्तक “जीत आपकी” भेज रहा हूँ।
Google : I am sending the gift book “Vijay Your” of Shiv Khehra.
My model: I am sending you a “victory”, a renowned book of Shiva Kheda.

experiment 4 : used single quote for the name of the book
उपहारस्वरूप शिव खेड़ा की प्रचलित पुस्तक ‘जीत आपकी’ भेज रहा हूँ।
Google : I am sending the popular book “Vijay Your” of Shiv Khehra.
My model : I am sending you a ‘victory’ in the present book of the gifted Shiv Khera.

experiment 5: changed the name of the book to my name in single quote
उपहारस्वरूप शिव खेड़ा की प्रचलित पुस्तक ‘अजितेश शर्मा’ भेज रहा हूँ।
Google: I am sending the famous book ‘Ajitesh Sharma’ by Shiv Khehra.
My model : I am sending gifted gifts to Ajitesh Sharma.

experiment 6 :changed the name of the book to my name in double quotes.
उपहारस्वरूप शिव खेड़ा की प्रचलित पुस्तक “अजितेश शर्मा” भेज रहा हूँ।
Google: I am sending the famous book “Ajitesh Sharma” by Shiv Khehra.
My model:I am sending a gifted book called Ajitesh Sharma.

a) What is the reason of this random behaviour wrt to these punctuations and proper nouns
b) when the name of the book is my name “ajitesh sharma” it is transliterating properly as it should do to proper nouns, but when name is “jeet apki” it is trying to give meaning to the proper noun and that too with no reasonability and no pattern.
I have tokeninzed my hindi input sentences using indic nlp and english using moses tokenizer for both training and test. And I am also using sentencepiece after that on both as it should be.

Any suggestion on where I am missing or doing wrong

Dear Ajitesh,

1- Do you use the Transformer model? Paraphrasing is a known behaviour of the Transformer model, especially when it cannot find the translation.
2- Do you have (several instances of) these proper nouns in your training data? If not, this is the main reason.
3- Google (and most MT professionals) apply pre-processing and post-processing steps that handle such issues. NMT models will not transliterate a word just because it is a proper noun (unless it learned this from the training dataset). Actually, YOU have to handle such linguistic issues.

As I said the best way is to train the model on these words. However, there are several posts on the forum about methods for handling terminology during the translation time (on the fly), like training the model on a placeholder (any unused character), then during the translation time, replacing a specific word/phrase in the source with this placeholder, and then replacing the placeholder in the final translation with the correct target word/phrase. To select the word/phrase you want to replace in the source, you need to either have a list, or in the case of proper nouns, you might use some named-entity recognition method.

Kind regards,
Yasmin

1 Like

Hi Yasmin,

  1. Yes I am using the transformer model only.Paraphrasing is not an issue for me, infact it only helps.
    2.Let me take one example. the famous book named “To Kill A Mockinbird”. In my data set source(hindi) , target (english) I am having all the words “to”, “kill”, “a”,“mockingbird” both in source as well as target, but not the book name “To Kill A Mockinbird”. When I am translating from hindi to english where these words are separetely used(i.e not the book title as a whole), my model is translating welll. Now when I try to use the title as whole in my hindi sentence " एक मॉकिंगबर्ड को मारने के लिए" and translate it, it is kind of giving me the translation and not the transliteration. I hope I made my point.
    3.Yes I am preprocessing as well as post processing my data(tokenizing, using subwords/sentencepiece). But not sure how to handle this problem

Could you please elaborate your last praragraph about handling terminology .How to achieve that

I have two other question
1.Is it required to remove punctuation before training from the data set and also from test data. Because my translation is giving weird result when I play with pucntuation like adding or removing them including full stop, inverted commas etc. You can see the examples in my original post. What is the reason for such behaviour?

  1. OpenNnmt py, what does it use inside , is it word emdeddings or sentence embedding? I encode my sentences using sentence piece and I get something like this [‘▁a’, ‘▁black’, ‘▁box’, ‘▁in’, ‘▁your’, ‘▁car’, ‘▁?’]. This i preprocess using the preprocess.py script and feed to train.On translation I get output in similar format which I decode to make it readable. So a what stage the embedding happens, is it the first layer of encoder-decoder transformer?And what algorithm does it use for embedding word2vec, glove, fasttext ??

Thank in advance

Dear Ajitesh,

Regarding the placeholder approach I was talking about, you can check this post for example, especially the answer by Jean:
http://forum.opennmt.net/t/if-there-is-any-way-to-keep-placeholders-as-same-as-source-when-call-nmt/

In your training data, you must have this placeholder translated to the same placeholder in several sentences so that when you use it during translation, the model will give it to you back. This way, you can replace this placeholder with what you want. Is it clear now?

As for punctuation marks, if you delete them from the target during training, you might end up without them in the translation. I do not know Hindi, but I do not see Latin punctuation marks in your example source sentences. If you mean a specific punctuation mark like quotes, maybe if they are causing you problems, but again be careful.

By the way, what is the size of your training dataset?

Kind regards,
Yasmin

hi Yasmin,
Pardon me for delayed response. I checked the above placeholder approach.It did work for named entities which are proper nouns like names. But my problem is with nouns which are improper for eg. the title of a book “Competition Success Review” . Since in my case I am using it as a noun which is a book name, I want it to be transliterated not translated in the target language. Any idea how can i achieve this ??
My data size is around 1M

Dear Ajitesh,

Could you please elaborate on what you did exactly.

Also, one question: is this book title so popular that you are ready to add to a terminology list among others to use during translation pre-processing/post-processing or initially to your training data, or do you mean you want your model to do this on its own for unseen book names? If the latter, depending on which criteria?

Kind regards,
Yasmin

Hi Yasmin,
Its not just book title or one or two noun. Let me elaborate. “United States of America” , this is the name of a country and hence proper noun. In my training set, I have, let’s say, some 50 example which translate it to its hindi version संयुक्त राज्य अमरीका (Sanyukt Raajy Amareeka). This is translation its perfectly fine. But, in some places, I want it to be transliterated to यूनाइटेड स्टेट्स ऑफ अमेरिका which is just the trasnliteration(based on phonetics). Any idea how could I achieve this. One solution for sure, is to train my model like that. But, I can’t do that because my model is trained for translation and not transliteration. Can achieve this by using some kind of place holder approach. I hope I am able to explain my problem now

Dear Ajitesh,

What was the placeholder you used? How do you used it during preprocessing text to be translated and post-processing of the translation?

Again, do not you agree that for book titles that have meanings and still can be transliterated in other cases, the reader would expect such behaviour? I mean Machine Translation should not be 100% perfect. Plus, maybe the reader wants to have both in the final version (the meaning of the book title and its transliteration). The issue here is that Hindi does not have capital letters (I searched Google) and those titles are not in your training data, so unless you can find a linguistic feature that distinguishes these titles, the only option I can think of (other than adding to your training data) is to replace manually using the placeholder method.

If transliteration is really important for your domain, it should be. For example, you might have both a person’s name as a word and transliteration. In this case, you will find both in the training dataset and it is easy for the model to distinguish the difference by itself, like when you say “I met xxx” as a proper name vs. “it is very xxx” as an adjective, for example.

Kind regards,
Yasmin