Combining Fuzzy Matches with NMT

vincent · October 26, 2017, 1:53pm

It would be nice to have something like the XML feature in Moses to specify required translations for parts of the source sentence – this way we could combine fuzzy matches from a TM and use NMT to only translate the unmatched part, as in Koehn & Senellart (2010 AMTA)

tel34 · October 26, 2017, 3:26pm

Hi Vincent,
I’ve been experimenting with sending “pretranslations” through NMT. That seems to work OK and they generally pass through “unscathed” IF they are untagged. I’ve found that if you tag them NMT starts to do strange things. I guess we would need to include some tagged sentences in the training material as others have mentioned.

jean.senellart · November 1, 2017, 7:40am

Hi @vincent, this will soon be possible with lexical beam search implementation that is coming in 0.9. It is not exactly equivalent to getting the Moses XML tags because you can not force specific part of the sentence to be translated in some strict way (since there is no strict source-target alignment), but it should work pretty well in most cases

vincent · November 7, 2017, 1:19pm

Hi Jean,
Thanks for the interesting article. Any idea on when this will be available – what will be the approx. release date of 0.9?

jean.senellart · November 7, 2017, 9:21pm

just done !

vincent · April 4, 2018, 8:13am

Any plans to include pointer networks? If you could somehow mark this on the input string, as in Moses, this seems like a way to copy stuff from the input straight to the output.

tel34 · April 4, 2018, 7:44pm

I found that workaround of mine only worked for me in v0.7. In v0.9 it drives the model crazy and produces unhelpful output. I’ve had much more success with specializing models via retraining/incremental training.

mayub · December 10, 2019, 2:44am

@vincent / @tel34 I have used OpenNMT-tf models in the past, I’m researching into methods to intregrate Translation Memories into my models. Any thoughts you can share based on your experience with this ?
I recently saw this paper trying to tackle the same and reported increase in BLEU scores.

Thanks!

vincent · December 10, 2019, 12:05pm

That is a very recent paper that I didn’t know. Tx. I am currently no longer working on this subject. The most recent paper about this that I know about is https://www.aclweb.org/anthology/P19-1175/

tel34 · December 10, 2019, 12:30pm

My experience in this matter was with the Lua version of OpenNMT which is now deprecated. I am now working with OpenNMT-tf but have not yet experimented with this subject. Have you considered whether “Guided alignment” would be useful for this?

mayub · December 11, 2019, 5:44am

Great Thanks I will check it out.

mayub · December 11, 2019, 5:51am

Do you mean “Pharaoh alignments” options ? If yes, I have used this for visualizing attention weights from the OpenNMT-tf models.

Not sure how it can be used for Translation Memories though.

Do you think it would be a good PR to include in the GitHub, as lot of the translation industry in past has used TM’s and they seem to provide good lift in BLEU score.

Thanks !

tel34 · December 11, 2019, 10:06am

Are you thinking of doing this “on-the-fly” during inference?

mayub · December 11, 2019, 7:13pm

On the fly would be great and would make sense. But open to offline training as well.
A hypothetical scenario would be - I have maintained TM’s in a separate old system (or database) which is still collecting new translations from old users. I have to either download them (do data dump) and append it offline to my new Neural system.

tel34 · December 11, 2019, 8:19pm

In that case couldn’t you just apply a script to extract source & target segments from your TM into separate source & target files and then use incremental training? Most commercial TM’s provide a mechanism for filtering out segments on many different criteria, and this kind of training takes much less time than training the baseline model.

mayub · December 12, 2019, 4:18am

Yes. Incremental training would work best for offline mode. And for on-the-fly training what would you suggest ?

guillaumekln · September 15, 2020, 2:52pm

Researchers from SYSTRAN recently published a paper on this subject. They explored several data augmentation techniques to make use of preexisting translations, specifically fuzzy matches from a database of translation memories. They show this can consistently improve translation accuracy and also provide dynamic adaptation for unseen translation pairs. You can find all the details in the paper:

In addition to the paper, the team also published a fuzzy matching library on GitHub. It is a fast C++ implementation that can efficiently return fuzzy matches from a precompiled index of translation memories:

It is not integrated in the OpenNMT training framework but it is available for you to use. Enjoy!