Input vector as input and branch encoder

Hi, can I double check:

  1. if I wish to input a vector instead of the embedding by the embedding layer, I can achieve this via the “loading pretrained embeddings method” when I am actually loading in my own vectors?

  2. I’m interested in branching e.g. 2 encoders to 1 decoder, meaning that the encoded hidden state passed to the decoder can be concat or sum of the multiple encoder. Also, each encoder can have their own set of inputs and able to load it’s individual pretrained embeddings or normal input vectors if point (1) is correct. Just wondering if work has been done on this.



  1. See here:

  2. No work has been done to implement this but we could add it in the future. For example, the -brnn option is actually a combination of two independent encoders processing the input in different direction.

1 Like

Hi, what is the use case you have in mind? we have been discussing internally for a while the possibility of plugging multiple encoders - and I am interested to know the different use cases before we move ahead in the implementation. As Guillaume mentions, the structure is already available for such extension but we would like the design to be as flexible as possible.

1 Like

Thanks for the fast replies!

I’m trying to extend seq2seq to other applications, not on NMT. IMO, the main benefit of multiple encoders is to deal with different forms of sources. For example, in IoT case, one encoder could encode temperature information, and another can encode sounds waves and so on. It happens that I have a sequential output of severity codes. Tossing these two forms of data of different sources and value scales in a single encoder does not seem ideal as compared to concatenating the encoded hidden state. Unfortunately I am not good on Lua to contribute technically.

By the way, I’m very interested in the word features capability
Appreciate if you can help me confirm my understanding:

  1. The word features embedding are optimized the same way as a normal embedding for a word in NMT through gradient updates. If the word embedding is 100 dim with a feature embedding of 50 dim, this means it is updated as if it is 150 dim “word embedding” just that the update for the 50 dim goes to the feature embedding.
  2. If I turn fixed embedding on (e.g. fix_word_vecs_enc), does it still update both the word and feature embedding?


  1. Yes, word features embedding are optimized the same way the word embeddings are. All embeddings are then concatenated (by default) and feed to the RNN.

  2. Currently, fixing embeddings only works for word embeddings. Word features embeddings are still optimized when this flag is enabled.

1 Like

Also : target features are time shifted. To avoid this, you need this mod:

1 Like

yes - it is also the type of use cases we are thinking about - we are currently working on few other features, but we will keep in mind this request in the near future.

1 Like

Is this an issue only if I include feature embeddings for the target side?
Currently, I’m interested in only the encoder side.

Does this means it is fine?

As far as I know, time shifting is only on the decoder side.

1 Like

Also if this branching is implemented, the ability to create architectures like these ( will make openNMT extremely powerful

Hi @rahular - thanks for sharing this paper - I am adding it to my list…

Hi jean,

Has this feature been pursued yet? I am also interested in using these type of models, for multimodal speech recognition. Specifically, I would like to combine the pyramidal encoder for speech and the cnn encoder for vision etc.

Hi Shruti,

Currently, we have no plans to add this feature, at least in OpenNMT Lua. However, we are working on a new project that will support these types of models. Stay tuned!

1 Like

This is supported in OpenNMT-tf.

See for example the multi-source NMT model. One of the input file can be changed to serialized real vectors.

1 Like

Thanks Guillaume! OpenNMT-tf is exciting!