Modifying the Decoder

I would like to try modifying the decoder by accepting partial translated inputs, and let the decoder ‘complete’ the translation using what’s provided. This should be helpful in Human-in-the-loop machine translations as well.

For example, for FR-EN,
If I have “Bonjour, je suis James.”, a FR-EN model might translate it as “Hello, I’m James.”

However, I might want a more casual translation, and have it “Hi, I’m James.” (Ignore the translation accuracy. Just an example for illustration)

For Transformers, it seems that phrase tables don’t really work well because even when trained with alignment, attention tends to span across multiple subwords or even words, so selecting the token with the maximum attention sometimes doesn’t do a good job—you can’t really just find the source terms (that’s in a phrase table), and replace the aligned target terms with the defined value.

I’m wondering if I could include the phrase table (Bonjour|||Hi) into the decoding process.
So the model will see as src,
Bonjour. Bonjour, je suis James
and in the decoding process, we can insert Hi. right after the <bos> keyword and let the decoder carry out the decoding. Assuming it sees Hi, the attention on Hi might be ‘enough’ to nudge the translation of Bonjour to Hi instead of Hello. Of course, this is just a simple example.

It’s an attempt to have the translation be more consistent in document-based translations. Instead of translating a particular term to the various synonyms or variations of a name (Madalene, Madeline, etc) because of a lack of fine-tuning (a phrase table might only contain a few lines), this should help boost consistency.

Any ideas on how I could go about doing this?

1 Like

Hi James!

If I got your question correctly, prefix-constrained decoding is already implemented in CTranslate2 as explained here.

You can find an example here:

I have applied it here my French-to-English model. If you click a word, you can get suggestions. If you select a suggestion, the translation will complete accordingly.

If you rather mean forcing a word in the middle, you can find this feature in FairSeq and Sockeye, but in my experience, it does not work as good as prefix-constrained decoding.

All the best,
Yasmin

1 Like

Yes, this seems to be what I want.

prefix_phrases_tok = tokenize(prefix_phrases, sp_source_model)

Is there a problem with this line? Should it be tokenizing using sp_target_model

EDIT: I’ve gotten it to work. I’m seeing some promising results. Thanks!

Yes, you are right. I corrected it. Thanks!