Combined embeddings as inputs

Hello all,

I try to mix word and document embeddings

Let’s say:

  • I have pretrained and fixed classic word embeddings (dim 512).
  • My corpus is made of documents so I also have document embeddings for each document (dim 512).
  • I know for each word in which document it appears

I’ve heard I can create a corpus like this (see below) and tell opennmt-tf to concatenate (say I want to concatenate here) embeddings of each word with its corresponding document embedding during training/inference

"The|doc1 cat|doc1 is|doc1 …
Fruits|doc2 are|doc2 delicious|doc2 …"

However I don’t know where to start to train my model using this possibility

Thank you in advance,
Valentin

Hi,

You should define a model with a ParallelInputter as the input layer. See for example this custom model that concatenates 3 input streams on the depth dimension:

For your document embedding, you can either use a WordEmbedder and configure pretrained embeddings, or directly store the document vectors in a TFRecord file:

http://opennmt.net/OpenNMT-tf/data.html#vectors

Let me know how it goes.

1 Like

Great answer, I will let you know how it goes

I’ve only one remaining question:
Is using the pattern “word|doc_number” in my training file the right way of doing it ?

I guess the first WordEmbedder will correspond to word and the second to doc_number but is using the concatenation character the good method ?

Thanks

No, OpenNMT-tf removed this syntax which can be error prone. Instead, you should provide two separate and aligned files.

1 Like

If I understand correctly my yml config will have

data:
  train_features_file:
    - train_file
    - train_file_docs

With train_file looking like

The cat is here .
I like dogs .

And train_file_docs looking like

<doc1> <doc1> <doc1> <doc1> <doc1>
<doc2> <doc2> <doc2> <doc2>

(Implying these phrases belong to different docs)

I guess the order of these files in the yml is corresponding to the order declared in the ParallelInputter ?

Also what happens if the alignment (word wise) in a sentence is not correct ?

Thanks for your time

Yes.

If the number of tokens is not the same, it will most likely crash at some point.

I was able to test your solution, and it seems to work fine

However I have a problem related to what I’m trying to do, in fact my document embeddings file is very large (dozens of GB)

When I train my model using this embeddings file I get

`ValueError: Cannot create a tensor proto whose content is larger than 2GB.`

And I guess it is related to the size of my document embedding file since when I try with a small file it does not show any error

Maybe this question needs its own thread and I will create it if your think it’s better