If I translate the following sentence in DeepL from French to English…
La crise liée à la COVID-19 a creusé les inégalités préexistantes.
… I get the following translation:
The VIDOC-19 crisis has deepened pre-existing inequalities.
1- If I click the proposed word “VIDOC-19”, I can get other suggestions like “COVID-19”.
2- If I change the word “pre-existing” to say “already”, it changes the next part of the translation accordingly.
I understand that “1” can be done by word alignment and “2” can be done by something like lexical constraints. My question: is it possible to apply 1 and 2 with OpenNMT (either py or tf) without changing the code?
For autocompletion, the target prefix is fed in teacher forcing mode into the decoder. Then we simply continue decoding from there. It’s the same as GPT constrained generation for example.
For the alternatives, we also feed the prefix in teacher forcing mode and expand the next N most likely tokens. Then we continue the decoding for these N hypotheses independently. This approach could be improved to give more/better alternatives.
Many thanks again for your great help! This was really useful.
I noticed in DeepL, they offer word-level suggestions that might be two or three words instead of one, e.g. “has deepened”, “deepened”, “has increased”, “increased”, etc. This does not seem to me like just slicing the sentence-level translation alternative. I am thinking they might be using word alignment. I am just wondering if there is a logic that can be applied directly using OpenNMT/CTranslate during the decoding time, is there?
Actually the alternatives in CTranslate2 are not restricted to a single word. The current method just finishes the translation using different subtokens as starting points.
For example, let’s say that at a specific position the 2 most likely candidates are:
▁has
▁inc
then the translation could complete the translation like this:
▁has ▁inc reased .
▁inc reased .
So 1. is effectively a 2-word alternative, but of course CTranslate2 does not know what a word is. You should add this logic on top of the library and delimit where the alternative expressions end.
However, in your example there is something we are not doing currently which is to output multiple alternatives starting with the same word (e.g. “has”). The initial starting points are unique. But you could detect that words like “has” can produce other alternatives and then include “has” in the target prefix to get these alternatives.