I trained my model with some specific 2 lakhs of parallel sentences. The corpus has good long sentences. When I am testing those sentences which are related to my training data domain I am getting good result even for longer sentences(probably because my training corpus is also composed of long sentences). But one thing I dont understand is why my model is not able is not able to translate some very basic sentences like Where are you going? What is my name? etc. Surely, these vocabulary will be there in the training data if not the whole sentences. Any particular reason for such behaviour
I recommend using translation memory.
There’s nothing like TR when you process short sentences.
could please elaborate about it.
Translation Memories (TMs) are databases of paired sentences that have been pre-translated. During the translation process, TM entries are matched with sentences in the source text.
It’s a database that translates things like short dialog sentences in advance.
What Park suggests is if you are developing a complete system that offers translation, you can check in a Translation Memory (basically your original dataset as source and target segments) first to see if the sentence to be translated (short, long, or whatever) is available as is. In this case, you might need to use some similarity/matching algorithm like Edit Distance to match also fuzzy matches that are slightly different than the source. This is a brilliant solution, and many MT professionals use similar solutions, but maybe it is not your question.
Your question is how to solve this in OpenNMT. As you said, as your model is trained on long sentences, it gives good translations with long sentences. So simply the solution to get good translations with short sentences is to add shorter sentences (maybe split from longer ones) to your training dataset. When I did that, I got better translations for shorter sentences, but when I exaggerated in adding too many short phrases (I extracted several possibilities), some translations included unnecessary words and the overall BLEU was lower. So to conclude, you need short sentences/phrases in your training dataset; however, do not exaggerate by adding too many short sentences that are much more than the original number of long sentences, especially if your domain usually has more long sentences than short ones. It is a tradeoff; so you have to identify your priorities.
If you want to add more short sentence pairs from another corpus, then you could try to treat it as multi-domain adaption. First try to add labeled, short sentences to your existing model. Perhaps it will increase the translation quality of longer sentences with no label through parameter sharing.