Using EmbeddingsSharingLevel in a dual source transformer

BramVanroy · June 10, 2020, 1:09pm

Hi

We are interested in using the dual source transformer for our research. Going through the code, it seems that for the different input sides, a different embedding is created. That is what I expect. However, I would like to have one of these input embeddings shared with the target embedding. So basically having one source text as input (in1), one target text as input (in2), and another target text as output (out1), so that the "target"s (in2, out1) have shared embedding space. Is that possible?

It seems that share_embeddings only takes one value so I don’t think it is possible to force different behaviour for the different inputs.

In addition, I do not quite understand the (very brief) explanation of EmbeddingsSharingLevel. Does SOURCE_TARGET_INPUT mean that the source and target words are in the same embedding space? (One vocabulary to rule them all.) If so, I am not sure what TARGET means (“share target word embeddings and softmax weights”): share with what? What is meant with sharing here? If you could elaborate on the meaning of these levels, I’d be grateful.

Hereby I would also like to request a port of the multi-source transformer to the PyTorch version.

guillaumekln · June 10, 2020, 1:31pm

Hi,

The current share_embeddings argument does not allow to share the target embedding with a single source embedding. However, this can easily be achieved in the custom model definition by overriding the build method (based on https://github.com/OpenNMT/OpenNMT-tf/blob/master/config/models/multi_source_transformer.py):

"""Defines a dual source Transformer architecture with serial attention layers
and parameter sharing between the encoders.

See for example https://arxiv.org/pdf/1809.00188.pdf.

The YAML configuration file should look like this:

data:
  train_features_file:
    - source_1.txt
    - source_2.txt
  train_labels_file: target.txt
  source_1_vocabulary: source_1_vocab.txt
  source_2_vocabulary: source_2_vocab.txt
  target_vocabulary: target_vocab.txt
"""

import opennmt as onmt

from opennmt.utils import misc


class DualSourceTransformer(onmt.models.Transformer):

  def __init__(self):
    super().__init__(
      source_inputter=onmt.inputters.ParallelInputter([
          onmt.inputters.WordEmbedder(embedding_size=512),
          onmt.inputters.WordEmbedder(embedding_size=512)]),
      target_inputter=onmt.inputters.WordEmbedder(embedding_size=512),
      num_layers=6,
      num_units=512,
      num_heads=8,
      ffn_inner_dim=2048,
      dropout=0.1,
      attention_dropout=0.1,
      ffn_dropout=0.1,
      share_encoders=True)

  def build(self, input_shape):
    super().build(input_shape)
    # Share target embedding with the second source embedding.
    self.labels_inputter.embedding = self.features_inputter.inputters[1].embedding

  def auto_config(self, num_replicas=1):
    config = super().auto_config(num_replicas=num_replicas)
    max_length = config["train"]["maximum_features_length"]
    return misc.merge_dict(config, {
        "train": {
            "maximum_features_length": [max_length, max_length]
        }
    })


model = DualSourceTransformer

Here we simply assign the target embedding with the embedding of the second source. Does that make sense?

Correct. All source and target inputs will use the same embedding weight.

In this case, the target embedding is reused as weight of the last linear transformation, the one that maps the decoder output to the vocabulary size before softmax.

Hope this helps.

BramVanroy · June 10, 2020, 1:45pm

Wow, thanks for the swift reply! This is very helpful, indeed.

Looking at the docstring with the YAML config example, is the source_2_vocabulary still required? Or can this be the same as the target_vocabulary or even an empty file, assuming that the “target” embedding will be build based on the target vocabulary. (Or, looking at how you implemented it, the other way around.)

A related question: if I understand correctly, dual source creates two encoders whose output states are then reduced to a single state which serves as the input for the decoder. This seems indeed a good way to handle this and I like the different options that are available. However, I am curious to see if a multiheaded cross attention layer on top of the two encoders would yield interesting results. The intuition being that the model can contrast in1 with in2 and figure out interesting points of attention. I would assume that this requires deeper changes but asking doesn’t hurt, I suppose.

guillaumekln · June 10, 2020, 1:54pm

You should still configure both target_vocabulary and source_2_vocabulary, but in your case they will just point to the same vocabulary file.

Actually, in the case of the dual source Transformer that was shared above, the 2 encoder outputs are not reduced and there will be sequential attention layers in the decoder, exactly like this image:

But as you mentioned, there are also ways to reduce the encoder outputs if needed.

BramVanroy · June 11, 2020, 1:06pm

My bad, I hadn’t read the paper - I have now. That is some very interesting work for our use-case. I do not fully understand the motivation of tying encoder states, though except for saving compute. I assume that by using the code example that you used above, the embedding of in2 are not tied to those of in1. But what about the encoders themselves? Are they tied? And if so, can they be un-tied (with the risk of OOM I suppose)?

Thanks

guillaumekln · June 11, 2020, 1:11pm

See the argument share_encoders in the model example above, which you can disable.

The main concern with this type of model is the memory usage, so sharing encoder weights helps but it does not always make sense.

BramVanroy · June 11, 2020, 1:33pm

I have everything I need for now. Thanks a lot. You are doing great work for the community by replying so thoroughly and quickly!

BramVanroy · June 12, 2020, 3:16pm

Hi @guillaumekln, sorry to bump this.

If I use the dual source transformer as you propose, does the system allow empty inputs? Can for some samples in1 be empty and for others in2?

guillaumekln · June 12, 2020, 3:25pm

If one of the two is empty, the complete example will be filtered out during training.

BramVanroy · June 12, 2020, 3:35pm

Is there any way to make it work as I would expect? Perhaps just putting a single padding token in the “empty side”?

The idea is that you sometimes have additional data to learn from, but not always, but you still wish to make use of all the data.

guillaumekln · June 12, 2020, 3:53pm

Maybe the easiest is that you preprocess your training data and replace empty lines by a custom token?

BramVanroy · June 12, 2020, 4:14pm

Indeed. My fear is, though, that this token (in our case) will be part of the target vocabulary in which case it is an (unlikely) candidate in prediction even though it should never be predicted as a target token. My suggestion for the padding token was assuming that it never gets attention because it is filtered out by the attention mask, or at least that would be my hunch. (But then that may lead to zero dimension tensors if a whole sequence consists only of padding tokens that are ignored.)

guillaumekln · June 12, 2020, 4:30pm

The issue I see is that the softmax in the attention layer will return NaN if a sequence is just padding.

The custom token should work OK I think. As it would never appear on the target side, the probability to generate it is almost 0. And if it does, the prediction is likely unusable anyway because the model got lost.

BramVanroy · June 12, 2020, 4:42pm

Excellent. Thanks for your input!

BramVanroy · June 12, 2020, 8:16pm

Sorry to keep bringing this topic up but I think it’s all relevant to the same issue.

I implemented the model as you proposed and am currently training two versions with the same data and the same seed. The only difference is share_encoders is True in one and False in the other. At the start of the process I can see the difference in parameters:

Tied:

Number of model parameters: 92,018,294
Number of model weights: 320 (trainable = 320, non trainable = 0)

Not tied:

Number of model parameters: 110,933,622
Number of model weights: 418 (trainable = 418, non trainable = 0)

This difference seems very small for a full second encoder’s parameters. This small difference is also visible in:

memory usage: both using exactly 15773MiB
speed: steps/s = 4.21, target words/s = 12100 (tied) vs steps/s = 4.06, target words/s = 12031 (not tied)

In terms of results (only first evaluation after 5000 steps). I don’t care as much that the results are worse for the not tied version (even though I would expect otherwise), but especially the other differences seem shady.

tied: loss = 2.149598 ; perplexity = 8.581405
tied: loss = 2.452493 ; perplexity = 11.617277

These differences are so small that I wonder whether I made a mistake or whether the weight tying happens differently than I would expect. As I said, I am training two different models. Exactly the same model and config and seed, except for shared_encoders. Any thoughts?

guillaumekln · June 12, 2020, 8:27pm

I’m not sure about the exact counts but I think that’s correct.

One decoder layer is almost twice as big as one encoder layer (see the image a few posts above). So it makes sense that removing one encoder removes 1/4 of the weights.

TensorFlow uses all available GPU memory by default.

BramVanroy · June 12, 2020, 8:30pm

Thanks for the confirmation. Then I can safely let things run. Thanks again for all the continuous support!

BramVanroy · June 14, 2020, 11:36am

When running inference, I am getting the following warning. I think this is related to tying the in2 and target embeddings. Is it harmful in any way?

WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. Either the Trackable object references in the Python program have changed in an incompatible way, or the checkpoint was generated in an incompatible program.

Two checkpoint references resolved to different objects (<tf.Variable 'dual_source_transformer_1/embedding:0' shape=(29302, 512) dtype=float32, numpy=
array([[ 0.00684818,  0.00364716, -0.01369294, ..., -0.0140352 ,
         0.00345916,  0.00921109],
       [-0.0389531 , -0.01416353, -0.02679272, ..., -0.01685427,
        -0.02278348, -0.10971494],
       [ 0.01376689,  0.01407808,  0.01040961, ..., -0.00610482,
         0.00784335, -0.0126047 ],
       ...,
       [ 0.01562562, -0.04909495,  0.02456149, ..., -0.01563725,
        -0.00244068, -0.01827662],
       [-0.04773009, -0.06443136, -0.05259109, ..., -0.02209991,
        -0.00625305, -0.09353954],
       [-0.00408842,  0.00796524, -0.00671746, ...,  0.00455097,
        -0.00271277, -0.00612399]], dtype=float32)> and <tf.Variable 'dual_source_transformer_1/embedding:0' shape=(29302, 512) dtype=float32, numpy=
array([[ 0.00684818,  0.00364716, -0.01369294, ..., -0.0140352 ,
         0.00345916,  0.00921109],
       [-0.0389531 , -0.01416353, -0.02679272, ..., -0.01685427,
        -0.02278348, -0.10971494],
       [ 0.01376689,  0.01407808,  0.01040961, ..., -0.00610482,
         0.00784335, -0.0126047 ],
       ...,
       [ 0.01562562, -0.04909495,  0.02456149, ..., -0.01563725,
        -0.00244068, -0.01827662],
       [-0.04773009, -0.06443136, -0.05259109, ..., -0.02209991,
        -0.00625305, -0.09353954],
       [-0.00408842,  0.00796524, -0.00671746, ...,  0.00455097,
        -0.00271277, -0.00612399]], dtype=float32)>).

guillaumekln · June 14, 2020, 11:44am

I did not see this warning before. I will look to reproduce it.

Is the inference running fine after this warning?

BramVanroy · June 14, 2020, 11:48am

The inference runs fine, but performance is a bit lower than I expected (around 9 BLEU difference between evaluation and test so maybe not that bad). Not sure if it is related to the warning above. (It isn’t clear to me whether the warning indicates that loading the weights for embedding of in2 or target failed.)

Note: the warning occurs both for share_encoders set to False and True.