Terminology handling

Today there is this great feature which enables to COPY an unknown word, or to lookup in a phrase table.

It would be even better to have a similar mechanism for a DO_NOT_TRANSLATE list.

Eventhough it not necessarily an inssue in a standard workflow, with BPE it happens very often that some words get BPE tokenized at inference, then translated in a funny way and reassembled in an even funnier way.

Especially for abbreviations, or some specific terminology, it would be great to protect some words (from a list or phrase table) to be copied in target.

It would also be great to have the -phrase_table feature work with multiword phrases on both source and target, so we can fix the translation of multiword terminology entries

1 Like

feature is coming through grid beam search implementation that will be part of 0.9 - stay tuned

2 Likes

I guess that the phrase table with multi-words exist now, right? It seems that I do get some replacements of more than one words now. However this has not yet been documented in the Quick start, right? http://opennmt.net/OpenNMT/translation/unknowns/

Replacing one word with multiple words appears to work in the current implementation when you are displaying the translation. This is because the space character is the token separator.

However, this generates an invalid state where the translator returns a token that actually contains multiple tokens.

Thank you for the answer.

From the 1st part of your answer I see why using phrase table one to multiple words when translating does work (th translate.lua -replace_unk true -model -phrase_table phrase-table.txt ā€¦).
I also plan to test multiple to one/multiple to multiple words.

Btw I noticed that some entries from phrase_table are selected but others not (in which case some words remain as unknowns even if they do exist in the train data and/or in the phrase_table). Does this have to do with the weights calculated in the training?

Reg. the 2nd part, do you refer to post-editing?

Thanks

As you are using -replace_unk, there should not be any <unk> token in the output.

No, I just meant that using the phrase table to inject multiple target tokens is unsupported and not meant to work.

1 Like
  1. Yes, perhaps I did not explained it well. What I meant was that I do replace the with source but I still know that they are unknowns (as the source appears instead).

  2. OK, I understand. So, only 1-1 words are supported in phrase table as documented. But again, in our tests, we add unknown words in the phrase table (1-1) and some are picked in the translation while others are not which seems inconsistent. As an example, we added two entries in the phrase table that do not exist in the train data: one was used and the other not.
    Does the use of phrase table entries by the system depend on the learning during training, on the weights of translations, on the frequency of the words in the train corpus, etc?
    Perhaps it would be more efficient to translate the source train corpus and add all unknown words in the train data and then retrain the engine from scratch and/or retrain the pre-trained model.

It depends on the training for these aspects:

  • the model learned to produce a <unk> for this source token (and thus be a candidate for replacement by the phrase table)
  • the model learned to ā€œalignā€ the <unk> with this source token

Hi, For what itā€™s worth I have been able to go from one to many by underscoring the individual tokens on the target side and removing the underscores in post-processing, e.g. directieraad|||board_of_management. In v07 I am able to ā€œforceā€ the translation of multi-word expressions in a pre-processing run, eg Directie Toezicht Energie|||Office_for_Energy_Supervision and get that target translation in the OpenNMT output, but this doesnā€™t work in v9 and I havenā€™t had time to investigate why.

Many thanks Terence.
I had actually seen your previous comments in the forum for NL-EN. Very interesting.
From my side, I have managed to get some one-to-many substitutions with the use of phrase table but as this option is bias by the training, Iā€™d rather have it forced, either in a pre-processing phase (as you mentioned) or in a post-processing though phase (after getting the raw NMT, that is)

While reading https://arxiv.org/pdf/1704.07138.pdf and seeing https://github.com/chrishokamp/constrained_decoding, I feel that lexically constrained GBS can handle terminology even when a model is trained with BPE. Is that right?

However, that sounds as if the model should already have learnt the constraints, and the constraint will only give a higher score to a hypothesis containing the constrained term. If none of the proposed translations (for a given n_best amount of hypotheses) contain the constraint, then it wonā€™t work (and ā€œDoNotTranslateā€ will be translated as, say, ā€œNePasTraduireā€). Right?

So, what is the current state of affairs to handle DNT/terminology with BPE-based ONMT (Lua) models?

Hi,

Is there any such feature for OpenNMT-py and BPE?? I can only find contrained lexical decoding with grid beam search in the Lua version

Thanks

Iā€™ve met with good results using this paperā€™s instructions.

Basically you will need to do some preprocessing to your corpus to let it learn how to copy and use the terminology in the target sentence.

You know there is a transform for this, right ?

Kind of aware, but the documentation isnā€™t very clear, and Iā€™ve used my own implementation to pretty good success prior to this PR, but Iā€™m indeed looking for something better now.

With regards to the documentation, for example, if I have 3 data corpuses, and each corpus has separate terminology lists (that might have the same source term with different translation targets), what do I specify in the transforms:? Based on the FAQ The following options can be added to the main configuration (valid for all datasets using this transform):, it doesnā€™t seem possible. If I have a thousand data corpuses with a thousand individual terminology lists, am I to create a thousand ā€˜customā€™ transforms?

there is two aspects:

  1. At training time, we donā€™t really care, we just the model to learn that when it finds a source term it needs to replace with the corresponding one in the target. You need sufficient data and examples so taht the model learn this mechanics.

  2. At inference, you donā€™t have choice, it is a one to one matching, so if you need to replace srcterm by tgtterm then you need to provide this combination given the context of what you are translating.
    it is not related to the data sets but to what you are currently translating.

hope this helps.

1 Like