Terminology handling


(Vincent Nguyen) #1

Today there is this great feature which enables to COPY an unknown word, or to lookup in a phrase table.

It would be even better to have a similar mechanism for a DO_NOT_TRANSLATE list.

Eventhough it not necessarily an inssue in a standard workflow, with BPE it happens very often that some words get BPE tokenized at inference, then translated in a funny way and reassembled in an even funnier way.

Especially for abbreviations, or some specific terminology, it would be great to protect some words (from a list or phrase table) to be copied in target.


(Vincent Vandeghinste) #2

It would also be great to have the -phrase_table feature work with multiword phrases on both source and target, so we can fix the translation of multiword terminology entries


(jean.senellart) #3

feature is coming through grid beam search implementation that will be part of 0.9 - stay tuned


(Anna Samiotou) #4

I guess that the phrase table with multi-words exist now, right? It seems that I do get some replacements of more than one words now. However this has not yet been documented in the Quick start, right? http://opennmt.net/OpenNMT/translation/unknowns/


(Guillaume Klein) #5

Replacing one word with multiple words appears to work in the current implementation when you are displaying the translation. This is because the space character is the token separator.

However, this generates an invalid state where the translator returns a token that actually contains multiple tokens.


(Anna Samiotou) #6

Thank you for the answer.

From the 1st part of your answer I see why using phrase table one to multiple words when translating does work (th translate.lua -replace_unk true -model -phrase_table phrase-table.txt …).
I also plan to test multiple to one/multiple to multiple words.

Btw I noticed that some entries from phrase_table are selected but others not (in which case some words remain as unknowns even if they do exist in the train data and/or in the phrase_table). Does this have to do with the weights calculated in the training?

Reg. the 2nd part, do you refer to post-editing?

Thanks


(Guillaume Klein) #7

As you are using -replace_unk, there should not be any <unk> token in the output.

No, I just meant that using the phrase table to inject multiple target tokens is unsupported and not meant to work.


(Anna Samiotou) #8
  1. Yes, perhaps I did not explained it well. What I meant was that I do replace the with source but I still know that they are unknowns (as the source appears instead).

  2. OK, I understand. So, only 1-1 words are supported in phrase table as documented. But again, in our tests, we add unknown words in the phrase table (1-1) and some are picked in the translation while others are not which seems inconsistent. As an example, we added two entries in the phrase table that do not exist in the train data: one was used and the other not.
    Does the use of phrase table entries by the system depend on the learning during training, on the weights of translations, on the frequency of the words in the train corpus, etc?
    Perhaps it would be more efficient to translate the source train corpus and add all unknown words in the train data and then retrain the engine from scratch and/or retrain the pre-trained model.


(Guillaume Klein) #9

It depends on the training for these aspects:

  • the model learned to produce a <unk> for this source token (and thus be a candidate for replacement by the phrase table)
  • the model learned to “align” the <unk> with this source token

(Terence Lewis) #10

Hi, For what it’s worth I have been able to go from one to many by underscoring the individual tokens on the target side and removing the underscores in post-processing, e.g. directieraad|||board_of_management. In v07 I am able to “force” the translation of multi-word expressions in a pre-processing run, eg Directie Toezicht Energie|||Office_for_Energy_Supervision and get that target translation in the OpenNMT output, but this doesn’t work in v9 and I haven’t had time to investigate why.


(Anna Samiotou) #11

Many thanks Terence.
I had actually seen your previous comments in the forum for NL-EN. Very interesting.
From my side, I have managed to get some one-to-many substitutions with the use of phrase table but as this option is bias by the training, I’d rather have it forced, either in a pre-processing phase (as you mentioned) or in a post-processing though phase (after getting the raw NMT, that is)


(Wiktor Stribiżew) #12

While reading https://arxiv.org/pdf/1704.07138.pdf and seeing https://github.com/chrishokamp/constrained_decoding, I feel that lexically constrained GBS can handle terminology even when a model is trained with BPE. Is that right?

However, that sounds as if the model should already have learnt the constraints, and the constraint will only give a higher score to a hypothesis containing the constrained term. If none of the proposed translations (for a given n_best amount of hypotheses) contain the constraint, then it won’t work (and “DoNotTranslate” will be translated as, say, “NePasTraduire”). Right?

So, what is the current state of affairs to handle DNT/terminology with BPE-based ONMT (Lua) models?