Terminology handling

vince62s · September 1, 2017, 1:00pm

Today there is this great feature which enables to COPY an unknown word, or to lookup in a phrase table.

It would be even better to have a similar mechanism for a DO_NOT_TRANSLATE list.

Eventhough it not necessarily an inssue in a standard workflow, with BPE it happens very often that some words get BPE tokenized at inference, then translated in a funny way and reassembled in an even funnier way.

Especially for abbreviations, or some specific terminology, it would be great to protect some words (from a list or phrase table) to be copied in target.

vincent · October 26, 2017, 1:39pm

It would also be great to have the -phrase_table feature work with multiword phrases on both source and target, so we can fix the translation of multiword terminology entries

jean.senellart · November 1, 2017, 7:31am

feature is coming through grid beam search implementation that will be part of 0.9 - stay tuned

annasamt · July 17, 2018, 1:47pm

I guess that the phrase table with multi-words exist now, right? It seems that I do get some replacements of more than one words now. However this has not yet been documented in the Quick start, right? http://opennmt.net/OpenNMT/translation/unknowns/

guillaumekln · July 18, 2018, 8:06am

Replacing one word with multiple words appears to work in the current implementation when you are displaying the translation. This is because the space character is the token separator.

However, this generates an invalid state where the translator returns a token that actually contains multiple tokens.

annasamt · July 25, 2018, 8:12am

Thank you for the answer.

From the 1st part of your answer I see why using phrase table one to multiple words when translating does work (th translate.lua -replace_unk true -model -phrase_table phrase-table.txt …).
I also plan to test multiple to one/multiple to multiple words.

Btw I noticed that some entries from phrase_table are selected but others not (in which case some words remain as unknowns even if they do exist in the train data and/or in the phrase_table). Does this have to do with the weights calculated in the training?

Reg. the 2nd part, do you refer to post-editing?

Thanks

guillaumekln · July 26, 2018, 7:28am

As you are using -replace_unk, there should not be any <unk> token in the output.

No, I just meant that using the phrase table to inject multiple target tokens is unsupported and not meant to work.

annasamt · July 26, 2018, 9:55am

Yes, perhaps I did not explained it well. What I meant was that I do replace the with source but I still know that they are unknowns (as the source appears instead).
OK, I understand. So, only 1-1 words are supported in phrase table as documented. But again, in our tests, we add unknown words in the phrase table (1-1) and some are picked in the translation while others are not which seems inconsistent. As an example, we added two entries in the phrase table that do not exist in the train data: one was used and the other not.
Does the use of phrase table entries by the system depend on the learning during training, on the weights of translations, on the frequency of the words in the train corpus, etc?
Perhaps it would be more efficient to translate the source train corpus and add all unknown words in the train data and then retrain the engine from scratch and/or retrain the pre-trained model.

guillaumekln · July 27, 2018, 7:41am

It depends on the training for these aspects:

the model learned to produce a <unk> for this source token (and thus be a candidate for replacement by the phrase table)
the model learned to “align” the <unk> with this source token

tel34 · July 28, 2018, 8:11am

Hi, For what it’s worth I have been able to go from one to many by underscoring the individual tokens on the target side and removing the underscores in post-processing, e.g. directieraad|||board_of_management. In v07 I am able to “force” the translation of multi-word expressions in a pre-processing run, eg Directie Toezicht Energie|||Office_for_Energy_Supervision and get that target translation in the OpenNMT output, but this doesn’t work in v9 and I haven’t had time to investigate why.

annasamt · July 31, 2018, 1:26pm

Many thanks Terence.
I had actually seen your previous comments in the forum for NL-EN. Very interesting.
From my side, I have managed to get some one-to-many substitutions with the use of phrase table but as this option is bias by the training, I’d rather have it forced, either in a pre-processing phase (as you mentioned) or in a post-processing though phase (after getting the raw NMT, that is)

wiktor.stribizew · September 13, 2018, 10:14am

While reading https://arxiv.org/pdf/1704.07138.pdf and seeing https://github.com/chrishokamp/constrained_decoding, I feel that lexically constrained GBS can handle terminology even when a model is trained with BPE. Is that right?

However, that sounds as if the model should already have learnt the constraints, and the constraint will only give a higher score to a hypothesis containing the constrained term. If none of the proposed translations (for a given n_best amount of hypotheses) contain the constraint, then it won’t work (and “DoNotTranslate” will be translated as, say, “NePasTraduire”). Right?

So, what is the current state of affairs to handle DNT/terminology with BPE-based ONMT (Lua) models?

anderleich · October 1, 2020, 3:18pm

Hi,

Is there any such feature for OpenNMT-py and BPE?? I can only find contrained lexical decoding with grid beam search in the Lua version

Thanks

JOHW85 · September 8, 2023, 7:25am

I’ve met with good results using this paper’s instructions.

Basically you will need to do some preprocessing to your corpus to let it learn how to copy and use the terminology in the target sentence.

vince62s · September 8, 2023, 8:18am

You know there is a transform for this, right ?

github.com

OpenNMT/OpenNMT-py/blob/master/onmt/transforms/terminology.py

from onmt.utils.logging import logger
from onmt.transforms import register_transform
from .transform import Transform

import spacy
import ahocorasick
import re


class TermMatcher(object):
    def __init__(
        self,
        termbase_path,
        src_spacy_language_model,
        tgt_spacy_language_model,
        term_example_ratio,
        src_term_stoken,
        tgt_term_stoken,
        tgt_term_etoken,
        delimiter,

This file has been truncated. show original

JOHW85 · September 8, 2023, 10:55am

Kind of aware, but the documentation isn’t very clear, and I’ve used my own implementation to pretty good success prior to this PR, but I’m indeed looking for something better now.

With regards to the documentation, for example, if I have 3 data corpuses, and each corpus has separate terminology lists (that might have the same source term with different translation targets), what do I specify in the transforms:? Based on the FAQ The following options can be added to the main configuration (valid for all datasets using this transform):, it doesn’t seem possible. If I have a thousand data corpuses with a thousand individual terminology lists, am I to create a thousand ‘custom’ transforms?

vince62s · September 8, 2023, 11:43am

there is two aspects:

At training time, we don’t really care, we just the model to learn that when it finds a source term it needs to replace with the corresponding one in the target. You need sufficient data and examples so taht the model learn this mechanics.
At inference, you don’t have choice, it is a one to one matching, so if you need to replace srcterm by tgtterm then you need to provide this combination given the context of what you are translating.
it is not related to the data sets but to what you are currently translating.

hope this helps.