Using custom tags for domain adaptation

dmarin · March 18, 2021, 10:23am

Hi everyone,

I would like to consult you about a specific use case of domain adaptation. Our standard approach for domain adaptation is to finetune generic models, which works quite well. However, we would like our engines to further differentiate between two specific subdomains. In this case, it would not be very efficient to create additional finetuned models for those subdomains, so we are considering other approaches such as using tags in the data to make the model aware of the subdomain, when there is one. In that case, I assume we would need to eventually inject those custom tags at inference time when the subdomain is known.

Has anyone tried this approach with OpenNMT-tf? Would adding a custom tag to the start of the examples work, or we should consider something more aggressive? And, in the cases where no tags are added at inference time, would the quality be impacted (given the fact that the model has been partially trained with tags)?

At this moment, we are successfully using a TransformerBig model with aggressive pretokenisation and BPE. In-domain volumes vary from 400k to 800k sentence pairs. For the decoding, we use CTranslate2 engines.

I guess this requires experimentation above all, but any hint on how to start or where to look at will be greatly appreciated!

panosk · March 18, 2021, 10:49am

Hello,

Putting a tag either in start or end of sentences works just fine. There are various techniques with tokens, but this seems the most hassle free and effective. Although you have probably read some related papers, I got inspired by this Domain Control for Neural Machine Translation. A more recent one is this [2102.10160] Multi-Domain Adaptation in Neural Machine Translation Through Multidimensional Tagging.
I’ve been using this technique with amazing results as it offers a very high degree of terminology and style control. The downside is that the model looses some of its capability to generalize freely and that every sentence needs to have the token, both during training and during inference. It’s been a while since I experimented with partially annotated corpus, so I’ll take this opportunity to give it another try.
If possible, please report back your results :).

dmarin · March 19, 2021, 11:22am

Thanks a lot for the hints, @panosk.

It’s good to know that you managed to apply this technique successfully. Both papers are insightful. For anyone interested in the topic, I also found helpful tips in this paper: Multi-Domain Neural Machine Translation.

Just one last question. It might be a silly thing, but I wonder what the format of this tag should be. I suppose I can use a special symbol, such as ⸨subdomain⸩, so that the tag is not segmented by the OpenNMT tokenizer. Is this the correct approach, or is there a specific symbol for tagging whole sequences?

panosk · March 19, 2021, 11:44am

Yes, exactly. Especially if you use OpenNMTTokenizer, this is the suggested way that also ensures consistency with the other special symbols (placeholders) used by the tokenizer.

dmarin · August 25, 2021, 6:17pm

Hi all,

It took a bit of time, but I was eventually able to experiment with this technique in a proper way. With all the rest being exactly the same (parameters, data, steps, etc.), models with tags perform on average around 1 BLEU point more than models without (scores compared via bootstrap resampling). Also, as one could expect, shorter sentences benefit much more from the extra context than longer sentences, which show similar scores.

So, except for the fact that this technique is a bit cumbersome to bring into practice, I think it is worth trying, especially in multi-domain scenarios where terminology is important.

By the way, for those who want to try, I found that pre-appending the domain token to the sources without space between the sentence and the token produces slightly better results for some reason.

Many thanks, @panosk, for the guidance!