Multi-Source Translation

Bachstelze · February 27, 2017, 11:04am

Are there plans to support Multi-Source Neural Translation (paper: http://www.isi.edu/natural-language/mt/multi-source-neural.pdf and implementation: https://github.com/isi-nlp/Zoph_RNN). The scenario sounds a bit unusually but i think it adapts very well with all the multilingual patterns we already use.

bartek · February 27, 2017, 1:23pm

@Bachstelze,
I don’t know if that solves your problem but take a look at similar approach using current architecture:
https://arxiv.org/abs/1702.06135

Bachstelze · February 27, 2017, 2:39pm

If i get the approach right this (“Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages”) is only a simple solution for what openNMT is already doing when training the models. I think we have to differ the training and the real translation. In the implementation i posted the programm is able to improve the results when it gets the source sentence which should be translated from different languages.
As i mentioned before this scenario isn’t very usally because when you start translating you always have only the source text and none already translated korpus. This would get in the direction of interactive machine translation where you reuse the correction of humans to improve the translation to other languages.

srush · February 27, 2017, 4:37pm

There was discussion of this in our dev channel. It might happen if someone gets interested.

Bachstelze · February 27, 2017, 9:56pm

Have you discussed how this fits in the current architecture and which multi-source combination and attention mechanic we want to use?

jean.senellart · February 27, 2017, 10:45pm

What we are considering is to introduce a generic multi-encoder approach - which will allow natural parallel sentence analyses (it can be different languages, or some source and pretranslation). It should behave more nicely than just raw concatenation of different sentences.

rylanchiu · July 5, 2018, 5:43am

Hi, I wonder what is the progress of this feature. I am interested in it. Do yu have any idea? Thanks.

Bachstelze · July 5, 2018, 5:44pm

A generic multi-encoder approach would be able to reproduce the state of the art model like the Double Path Network (paper: https://arxiv.org/pdf/1806.04856.pdf ). The linked paper propose a cross Attention with Gating which persist off two modules too enable four types of information flows through the encoder to the decoder. Closer to the seeked types of encoding that incorporates pretranslated sentences is a Neural Automatic Post-Editing System ( paper: https://arxiv.org/pdf/1807.00248.pdf). The attention between the en- and decoder consist of a extra attention layer for each encoder that produces a context vector. Each context vector is then merged into one single context vector.
The last approach seems to have a way more parameters to learn but would be more flexible, clearer in the architecture and scalable to discretionary numbers of encoders.

What kind of attention mechanism between the encoder and decoder did you consider?

rylanchiu · July 6, 2018, 1:44am

The second one.

Bachstelze · July 7, 2018, 9:23am

It seems like the tensorflow implementation already offers multiple encoders! But the documentation is very poor. The proposed ConcatReducer and JoinReducer are hackneyed. Yet i don’t find a possible attention layer on top of the multiple encoders.

guillaumekln · July 7, 2018, 11:12am

Could you elaborate on what is missing from the documentation? I would like to complete it but I welcome help.

Bachstelze · July 9, 2018, 10:02am

The main lag, what i see in the documentation, is the completely missing description of the reducer module. The reason for this could be that their implementation are very banal in my opinion. I could be wrong but i can’t elaborate it because their is no description why to use one and not even a link to a section of a former paper. Maybe their just used from the Inputters?

Supposed i want to use a attention module as a final reducer on top of the different encoders (like in the mentioned post-editing system). How could i change the reducer? Do I have to reimplement a reducer module or can i just use an encoder as reducer?

In the package overview description http://opennmt.net/OpenNMT-tf/package/opennmt.inputters.html are missing the MixedInputter and ParallelInputter. Perhaps it is logical that they are in the opennmt.inputters.inputter module? A description of the module structure would help.
Then in the specific class description it is unclear if they are all generic. A guidance with more then one sentence per class or an example would help. And here again: Why use a certain reducer? In addition there are just method names in the description?!

guillaumekln · July 9, 2018, 12:12pm

The reducers are just objects that implement different ways of merging tensors : concatenation, sum, multiplication, etc. They are used extensively throughout the project to easily parametrize how things are merged, e.g.:

word and character-level embeddings
forward and backward states of a bidirectional RNN
input vector and positional encoding
etc.

Maybe there is a confusion on the scope of the reducer module which is actually fairly small.

To implement a custom attention mechanism, one would certainly need to use OpenNMT-tf as a library and implement a custom decoder.

Bachstelze · July 10, 2018, 4:21pm

Why should I concatenate, sum or multiplicate the outputs of mutiple encoders? Is there an example for using such reducers for multiple encoders? Why implement a custom decoder?

guillaumekln · July 10, 2018, 4:32pm

Not for multi-source translation but look at this Google paper and their “Hybrid NMT Models”:

https://arxiv.org/abs/1804.09849

It concatenates the output of a self-attention encoder and a RNN encoder.

If you don’t reduce the encoder output, that means the decoder have to compute attention on multiple heads which is currently unsupported.

bwriordan · July 17, 2018, 5:17pm

Is the architecture from this paper supported? https://arxiv.org/abs/1704.06393 Or would this require some custom components?

Bachstelze · August 24, 2018, 11:27am

Or would this require some custom components?

Supposed that you have your desired multi-source system, then you would require an extra SMT-System. In the linked paper the authors are using Moses which is the best documented and tested open-source SMT-system. It has different models like a string-to-tree decoder, but for a bigger variety you should also have a look at http://www.phontron.com/travatar/ a forest/tree-to-string SMT-system. So you need further components besides openNMT, though they should be ready to use for this case.

If you don’t reduce the encoder output, that means the decoder have to compute attention on multiple heads which is currently unsupported.

Couldn’t we use an extra attention-layer on top of the heads of the different encoders?

Bachstelze · August 30, 2018, 10:26pm

There are several approaches to calculate the attention over multiple encoders. Especially the post-editing systems are using multi-source architectures. I don’t know which is the best option and i already lost some approaches, therefore i just list them:

http://www.statmt.org/wmt17/pdf/WMT77.pdf

Bachstelze · November 13, 2018, 8:50am

A interesting research about multi-source transformer:

Input Combination Strategies for Multi-Source Transformer Decoder

Bachstelze · November 15, 2018, 9:44am

The state of the art in speech recognition uses stream attention for multiple encoders: