Today there is this great feature which enables to COPY an unknown word, or to lookup in a phrase table.
It would be even better to have a similar mechanism for a DO_NOT_TRANSLATE list.
Eventhough it not necessarily an inssue in a standard workflow, with BPE it happens very often that some words get BPE tokenized at inference, then translated in a funny way and reassembled in an even funnier way.
Especially for abbreviations, or some specific terminology, it would be great to protect some words (from a list or phrase table) to be copied in target.
It would also be great to have the -phrase_table feature work with multiword phrases on both source and target, so we can fix the translation of multiword terminology entries
I guess that the phrase table with multi-words exist now, right? It seems that I do get some replacements of more than one words now. However this has not yet been documented in the Quick start, right? http://opennmt.net/OpenNMT/translation/unknowns/
Replacing one word with multiple words appears to work in the current implementation when you are displaying the translation. This is because the space character is the token separator.
However, this generates an invalid state where the translator returns a token that actually contains multiple tokens.
From the 1st part of your answer I see why using phrase table one to multiple words when translating does work (th translate.lua -replace_unk true -model -phrase_table phrase-table.txt ā¦).
I also plan to test multiple to one/multiple to multiple words.
Btw I noticed that some entries from phrase_table are selected but others not (in which case some words remain as unknowns even if they do exist in the train data and/or in the phrase_table). Does this have to do with the weights calculated in the training?
Yes, perhaps I did not explained it well. What I meant was that I do replace the with source but I still know that they are unknowns (as the source appears instead).
OK, I understand. So, only 1-1 words are supported in phrase table as documented. But again, in our tests, we add unknown words in the phrase table (1-1) and some are picked in the translation while others are not which seems inconsistent. As an example, we added two entries in the phrase table that do not exist in the train data: one was used and the other not.
Does the use of phrase table entries by the system depend on the learning during training, on the weights of translations, on the frequency of the words in the train corpus, etc?
Perhaps it would be more efficient to translate the source train corpus and add all unknown words in the train data and then retrain the engine from scratch and/or retrain the pre-trained model.
Hi, For what itās worth I have been able to go from one to many by underscoring the individual tokens on the target side and removing the underscores in post-processing, e.g. directieraad|||board_of_management. In v07 I am able to āforceā the translation of multi-word expressions in a pre-processing run, eg Directie Toezicht Energie|||Office_for_Energy_Supervision and get that target translation in the OpenNMT output, but this doesnāt work in v9 and I havenāt had time to investigate why.
Many thanks Terence.
I had actually seen your previous comments in the forum for NL-EN. Very interesting.
From my side, I have managed to get some one-to-many substitutions with the use of phrase table but as this option is bias by the training, Iād rather have it forced, either in a pre-processing phase (as you mentioned) or in a post-processing though phase (after getting the raw NMT, that is)
However, that sounds as if the model should already have learnt the constraints, and the constraint will only give a higher score to a hypothesis containing the constrained term. If none of the proposed translations (for a given n_best amount of hypotheses) contain the constraint, then it wonāt work (and āDoNotTranslateā will be translated as, say, āNePasTraduireā). Right?
So, what is the current state of affairs to handle DNT/terminology with BPE-based ONMT (Lua) models?
Kind of aware, but the documentation isnāt very clear, and Iāve used my own implementation to pretty good success prior to this PR, but Iām indeed looking for something better now.
With regards to the documentation, for example, if I have 3 data corpuses, and each corpus has separate terminology lists (that might have the same source term with different translation targets), what do I specify in the transforms:? Based on the FAQ The following options can be added to the main configuration (valid for all datasets using this transform):, it doesnāt seem possible. If I have a thousand data corpuses with a thousand individual terminology lists, am I to create a thousand ācustomā transforms?
At training time, we donāt really care, we just the model to learn that when it finds a source term it needs to replace with the corresponding one in the target. You need sufficient data and examples so taht the model learn this mechanics.
At inference, you donāt have choice, it is a one to one matching, so if you need to replace srcterm by tgtterm then you need to provide this combination given the context of what you are translating.
it is not related to the data sets but to what you are currently translating.