Using features for domain/client/subject adaptation


I was wondering if it would be possible to make the model choose the correct translation based on features.

For example, lets say I have 5 different texts for training an MT system, where 2 of them are domain-specific, so some special words/terms have different translations in these two texts. If I tag all the words in the 3 generic texts with a feature |GENERIC and all the words in the 2 domain-specific texts with a feature |SPECIFIC, will the resulting system be able to use the correct translation depending on the feature provided in the source sentence? Also, would it be necessary to tag all the words or a simple sentence marker could also work?

In that case, I suppose the features must be added in the source side for training.


Here is a relevant paper:

The paper shows a consistent gain when tagging all words over using a single marker.

1 Like

Hi @guillaumekln,

Great, thank you!

One question before I start a new feature request topic :slight_smile: :

Currently, the rest_translation_server only understands the -case_feature option and calls the case.lua script, right? I mean, the only feature it can add automatically to the source text is the case.

That is correct.

Thanks @guillaumekln.

I read the paper, very interesting and promising so I’ll plan to train a model like this. Few more questions about it, hopefully @jean.senellart can also jump in the discussion to clarify, although my questions could be considered naive as they emerge mainly from a user’s perspective :slight_smile:
At Section 4.2, it is mentioned

It takes about 10 days
to train models on the complete training data set
(4, 3M sentence pairs)

I guess it took this long to train all the systems mentioned in the paper, right?

Also, what about using multiple features to fine-tune the domain adaptation? I mean building an hierachical dependency of domains/subject, something like this:


  • Programming_Languages
    and then embedding these features with the correct hierarchy, e.g

Open|Informatics|Operating_Systems|Linux a|Informatics|Operating_Systems|Linux terminal|Informatics|Operating_Systems|Linux

Would the system be able to learn that deep or would something like this introduce too much noise?

Maybe what I’m proposing is a simplistic implementation of this idea mentioned in the “Conclusions” section of the paper:

We plan to further improve the feature technique
detailed in this work. Rather than providing
the network with a hard decision about domain,
we want to introduce a vector of distance values
of the given source sentence to each domain, thus
allowing to smooth the proximity of each sentence
to each domain

It simply means that one training takes 10 days.

You can put whatever you want as a word feature as long as every words have the same number of them and in the same order. But it’s difficult to know whether the system will make good use of these additional information. It requires experimentation.