Using EmbeddingsSharingLevel in a dual source transformer

You should still configure both target_vocabulary and source_2_vocabulary, but in your case they will just point to the same vocabulary file.

Actually, in the case of the dual source Transformer that was shared above, the 2 encoder outputs are not reduced and there will be sequential attention layers in the decoder, exactly like this image:

image

But as you mentioned, there are also ways to reduce the encoder outputs if needed.

My bad, I hadn’t read the paper - I have now. That is some very interesting work for our use-case. I do not fully understand the motivation of tying encoder states, though except for saving compute. I assume that by using the code example that you used above, the embedding of in2 are not tied to those of in1. But what about the encoders themselves? Are they tied? And if so, can they be un-tied (with the risk of OOM I suppose)?

Thanks

See the argument share_encoders in the model example above, which you can disable.

The main concern with this type of model is the memory usage, so sharing encoder weights helps but it does not always make sense.

I have everything I need for now. Thanks a lot. You are doing great work for the community by replying so thoroughly and quickly!

1 Like

Hi @guillaumekln, sorry to bump this.

If I use the dual source transformer as you propose, does the system allow empty inputs? Can for some samples in1 be empty and for others in2?

If one of the two is empty, the complete example will be filtered out during training.

Is there any way to make it work as I would expect? Perhaps just putting a single padding token in the “empty side”?

The idea is that you sometimes have additional data to learn from, but not always, but you still wish to make use of all the data.

Maybe the easiest is that you preprocess your training data and replace empty lines by a custom token?

Indeed. My fear is, though, that this token (in our case) will be part of the target vocabulary in which case it is an (unlikely) candidate in prediction even though it should never be predicted as a target token. My suggestion for the padding token was assuming that it never gets attention because it is filtered out by the attention mask, or at least that would be my hunch. (But then that may lead to zero dimension tensors if a whole sequence consists only of padding tokens that are ignored.)

The issue I see is that the softmax in the attention layer will return NaN if a sequence is just padding.

The custom token should work OK I think. As it would never appear on the target side, the probability to generate it is almost 0. And if it does, the prediction is likely unusable anyway because the model got lost.

1 Like

Excellent. Thanks for your input!

Sorry to keep bringing this topic up but I think it’s all relevant to the same issue.

I implemented the model as you proposed and am currently training two versions with the same data and the same seed. The only difference is share_encoders is True in one and False in the other. At the start of the process I can see the difference in parameters:

Tied:

  • Number of model parameters: 92,018,294
  • Number of model weights: 320 (trainable = 320, non trainable = 0)

Not tied:

  • Number of model parameters: 110,933,622
  • Number of model weights: 418 (trainable = 418, non trainable = 0)

This difference seems very small for a full second encoder’s parameters. This small difference is also visible in:

  • memory usage: both using exactly 15773MiB
  • speed: steps/s = 4.21, target words/s = 12100 (tied) vs steps/s = 4.06, target words/s = 12031 (not tied)

In terms of results (only first evaluation after 5000 steps). I don’t care as much that the results are worse for the not tied version (even though I would expect otherwise), but especially the other differences seem shady.

  • tied: loss = 2.149598 ; perplexity = 8.581405
  • tied: loss = 2.452493 ; perplexity = 11.617277

These differences are so small that I wonder whether I made a mistake or whether the weight tying happens differently than I would expect. As I said, I am training two different models. Exactly the same model and config and seed, except for shared_encoders. Any thoughts?

I’m not sure about the exact counts but I think that’s correct.

One decoder layer is almost twice as big as one encoder layer (see the image a few posts above). So it makes sense that removing one encoder removes 1/4 of the weights.

TensorFlow uses all available GPU memory by default.

1 Like

Thanks for the confirmation. Then I can safely let things run. Thanks again for all the continuous support!

When running inference, I am getting the following warning. I think this is related to tying the in2 and target embeddings. Is it harmful in any way?

WARNING:tensorflow:Inconsistent references when loading the checkpoint into this object graph. Either the Trackable object references in the Python program have changed in an incompatible way, or the checkpoint was generated in an incompatible program.

Two checkpoint references resolved to different objects (<tf.Variable 'dual_source_transformer_1/embedding:0' shape=(29302, 512) dtype=float32, numpy=
array([[ 0.00684818,  0.00364716, -0.01369294, ..., -0.0140352 ,
         0.00345916,  0.00921109],
       [-0.0389531 , -0.01416353, -0.02679272, ..., -0.01685427,
        -0.02278348, -0.10971494],
       [ 0.01376689,  0.01407808,  0.01040961, ..., -0.00610482,
         0.00784335, -0.0126047 ],
       ...,
       [ 0.01562562, -0.04909495,  0.02456149, ..., -0.01563725,
        -0.00244068, -0.01827662],
       [-0.04773009, -0.06443136, -0.05259109, ..., -0.02209991,
        -0.00625305, -0.09353954],
       [-0.00408842,  0.00796524, -0.00671746, ...,  0.00455097,
        -0.00271277, -0.00612399]], dtype=float32)> and <tf.Variable 'dual_source_transformer_1/embedding:0' shape=(29302, 512) dtype=float32, numpy=
array([[ 0.00684818,  0.00364716, -0.01369294, ..., -0.0140352 ,
         0.00345916,  0.00921109],
       [-0.0389531 , -0.01416353, -0.02679272, ..., -0.01685427,
        -0.02278348, -0.10971494],
       [ 0.01376689,  0.01407808,  0.01040961, ..., -0.00610482,
         0.00784335, -0.0126047 ],
       ...,
       [ 0.01562562, -0.04909495,  0.02456149, ..., -0.01563725,
        -0.00244068, -0.01827662],
       [-0.04773009, -0.06443136, -0.05259109, ..., -0.02209991,
        -0.00625305, -0.09353954],
       [-0.00408842,  0.00796524, -0.00671746, ...,  0.00455097,
        -0.00271277, -0.00612399]], dtype=float32)>).

I did not see this warning before. I will look to reproduce it.

Is the inference running fine after this warning?

The inference runs fine, but performance is a bit lower than I expected (around 9 BLEU difference between evaluation and test so maybe not that bad). Not sure if it is related to the warning above. (It isn’t clear to me whether the warning indicates that loading the weights for embedding of in2 or target failed.)

Note: the warning occurs both for share_encoders set to False and True.

You can ignore this warning. It is related to how we assign the embedding in the build method which is not exactly the cleanest way to do that, but the embedding weight should have the correct value at the end.

1 Like

@guillaumekln Hi Guillaume, I bumped into an issue when using the architecture above (using shared embeddings between all inputs and the target). I want to use the replace_unknown_target functionality by adding to params, but during the first evaluation run it gives me this error. Any thoughts?

TypeError: replace_unknown_target is only defined when the source inputter is a WordEmbedder

Yes, this option is not implemented for multi source inputs.