Multi-Source Translation

Are there plans to support Multi-Source Neural Translation (paper: http://www.isi.edu/natural-language/mt/multi-source-neural.pdf and implementation: https://github.com/isi-nlp/Zoph_RNN). The scenario sounds a bit unusually but i think it adapts very well with all the multilingual patterns we already use.

@Bachstelze,
I don’t know if that solves your problem but take a look at similar approach using current architecture:
https://arxiv.org/abs/1702.06135

If i get the approach right this (“Enabling Multi-Source Neural Machine Translation By Concatenating Source Sentences In Multiple Languages”) is only a simple solution for what openNMT is already doing when training the models. I think we have to differ the training and the real translation. In the implementation i posted the programm is able to improve the results when it gets the source sentence which should be translated from different languages.
As i mentioned before this scenario isn’t very usally because when you start translating you always have only the source text and none already translated korpus. This would get in the direction of interactive machine translation where you reuse the correction of humans to improve the translation to other languages.

There was discussion of this in our dev channel. It might happen if someone gets interested.

Have you discussed how this fits in the current architecture and which multi-source combination and attention mechanic we want to use?

What we are considering is to introduce a generic multi-encoder approach - which will allow natural parallel sentence analyses (it can be different languages, or some source and pretranslation). It should behave more nicely than just raw concatenation of different sentences.

Hi, I wonder what is the progress of this feature. I am interested in it. Do yu have any idea? Thanks.

A generic multi-encoder approach would be able to reproduce the state of the art model like the Double Path Network (paper: https://arxiv.org/pdf/1806.04856.pdf ). The linked paper propose a cross Attention with Gating which persist off two modules too enable four types of information flows through the encoder to the decoder. Closer to the seeked types of encoding that incorporates pretranslated sentences is a Neural Automatic Post-Editing System ( paper: https://arxiv.org/pdf/1807.00248.pdf). The attention between the en- and decoder consist of a extra attention layer for each encoder that produces a context vector. Each context vector is then merged into one single context vector.
The last approach seems to have a way more parameters to learn but would be more flexible, clearer in the architecture and scalable to discretionary numbers of encoders.

What kind of attention mechanism between the encoder and decoder did you consider?

The second one.

It seems like the tensorflow implementation already offers multiple encoders! But the documentation is very poor. The proposed ConcatReducer and JoinReducer are hackneyed. Yet i don’t find a possible attention layer on top of the multiple encoders.

Could you elaborate on what is missing from the documentation? I would like to complete it but I welcome help.

The main lag, what i see in the documentation, is the completely missing description of the reducer module. The reason for this could be that their implementation are very banal in my opinion. I could be wrong but i can’t elaborate it because their is no description why to use one and not even a link to a section of a former paper. Maybe their just used from the Inputters?

Supposed i want to use a attention module as a final reducer on top of the different encoders (like in the mentioned post-editing system). How could i change the reducer? Do I have to reimplement a reducer module or can i just use an encoder as reducer?

In the package overview description http://opennmt.net/OpenNMT-tf/package/opennmt.inputters.html are missing the MixedInputter and ParallelInputter. Perhaps it is logical that they are in the opennmt.inputters.inputter module? A description of the module structure would help.
Then in the specific class description it is unclear if they are all generic. A guidance with more then one sentence per class or an example would help. And here again: Why use a certain reducer? In addition there are just method names in the description?!

The reducers are just objects that implement different ways of merging tensors : concatenation, sum, multiplication, etc. They are used extensively throughout the project to easily parametrize how things are merged, e.g.:

  • word and character-level embeddings
  • forward and backward states of a bidirectional RNN
  • input vector and positional encoding
  • etc.

Maybe there is a confusion on the scope of the reducer module which is actually fairly small.

To implement a custom attention mechanism, one would certainly need to use OpenNMT-tf as a library and implement a custom decoder.

Why should I concatenate, sum or multiplicate the outputs of mutiple encoders? Is there an example for using such reducers for multiple encoders? Why implement a custom decoder?

Not for multi-source translation but look at this Google paper and their “Hybrid NMT Models”:

https://arxiv.org/abs/1804.09849

It concatenates the output of a self-attention encoder and a RNN encoder.

If you don’t reduce the encoder output, that means the decoder have to compute attention on multiple heads which is currently unsupported.

Is the architecture from this paper supported? https://arxiv.org/abs/1704.06393 Or would this require some custom components?

Or would this require some custom components?

Supposed that you have your desired multi-source system, then you would require an extra SMT-System. In the linked paper the authors are using Moses which is the best documented and tested open-source SMT-system. It has different models like a string-to-tree decoder, but for a bigger variety you should also have a look at http://www.phontron.com/travatar/ a forest/tree-to-string SMT-system. So you need further components besides openNMT, though they should be ready to use for this case.

If you don’t reduce the encoder output, that means the decoder have to compute attention on multiple heads which is currently unsupported.

Couldn’t we use an extra attention-layer on top of the heads of the different encoders?

There are several approaches to calculate the attention over multiple encoders. Especially the post-editing systems are using multi-source architectures. I don’t know which is the best option and i already lost some approaches, therefore i just list them:

http://www.statmt.org/wmt17/pdf/WMT77.pdf

A interesting research about multi-source transformer:

Input Combination Strategies for Multi-Source Transformer Decoder

1 Like

The state of the art in speech recognition uses stream attention for multiple encoders: