Modifying the Decoder

JOHW85 · February 22, 2022, 3:57pm

I would like to try modifying the decoder by accepting partial translated inputs, and let the decoder ‘complete’ the translation using what’s provided. This should be helpful in Human-in-the-loop machine translations as well.

For example, for FR-EN,
If I have “Bonjour, je suis James.”, a FR-EN model might translate it as “Hello, I’m James.”

However, I might want a more casual translation, and have it “Hi, I’m James.” (Ignore the translation accuracy. Just an example for illustration)

For Transformers, it seems that phrase tables don’t really work well because even when trained with alignment, attention tends to span across multiple subwords or even words, so selecting the token with the maximum attention sometimes doesn’t do a good job—you can’t really just find the source terms (that’s in a phrase table), and replace the aligned target terms with the defined value.

I’m wondering if I could include the phrase table (Bonjour|||Hi) into the decoding process.
So the model will see as src,
Bonjour. Bonjour, je suis James
and in the decoding process, we can insert Hi. right after the <bos> keyword and let the decoder carry out the decoding. Assuming it sees Hi, the attention on Hi might be ‘enough’ to nudge the translation of Bonjour to Hi instead of Hello. Of course, this is just a simple example.

It’s an attempt to have the translation be more consistent in document-based translations. Instead of translating a particular term to the various synonyms or variations of a name (Madalene, Madeline, etc) because of a lack of fine-tuning (a phrase table might only contain a few lines), this should help boost consistency.

Any ideas on how I could go about doing this?

ymoslem · February 23, 2022, 1:53am

Hi James!

If I got your question correctly, prefix-constrained decoding is already implemented in CTranslate2 as explained here.

You can find an example here:

gist.github.com

https://gist.github.com/ymoslem/9784d1c2d2b67320b007838a6c643554

CTranslate2-example-adv.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import sentencepiece as spm
import ctranslate2


def tokenize(text, sp_source_model):
    sp = spm.SentencePieceProcessor(sp_source_model)
    tokens =sp.encode(text, out_type=str)

This file has been truncated. show original

I have applied it here my French-to-English model. If you click a word, you can get suggestions. If you select a suggestion, the translation will complete accordingly.

If you rather mean forcing a word in the middle, you can find this feature in FairSeq and Sockeye, but in my experience, it does not work as good as prefix-constrained decoding.

All the best,
Yasmin

JOHW85 · February 23, 2022, 2:12am

Yes, this seems to be what I want.

prefix_phrases_tok = tokenize(prefix_phrases, sp_source_model)

Is there a problem with this line? Should it be tokenizing using sp_target_model

EDIT: I’ve gotten it to work. I’m seeing some promising results. Thanks!

ymoslem · February 23, 2022, 8:35am

Yes, you are right. I corrected it. Thanks!