Combined embeddings as inputs

valentinmace · April 12, 2019, 9:21am

Hello all,

I try to mix word and document embeddings

Let’s say:

I have pretrained and fixed classic word embeddings (dim 512).
My corpus is made of documents so I also have document embeddings for each document (dim 512).
I know for each word in which document it appears

I’ve heard I can create a corpus like this (see below) and tell opennmt-tf to concatenate (say I want to concatenate here) embeddings of each word with its corresponding document embedding during training/inference

However I don’t know where to start to train my model using this possibility

Thank you in advance,
Valentin

guillaumekln · April 13, 2019, 7:33am

Hi,

You should define a model with a ParallelInputter as the input layer. See for example this custom model that concatenates 3 input streams on the depth dimension:

github.com

OpenNMT/OpenNMT-tf/blob/master/config/models/multi_features_nmt.py

"""Defines a sequence to sequence model with multiple input features. For
example, this could be words, parts of speech, and lemmas that are embedded in
parallel and concatenated into a single input embedding. The features are
separate data files with separate vocabularies.
"""

import tensorflow as tf
import opennmt as onmt

def model():
  return onmt.models.SequenceToSequence(
      source_inputter=onmt.inputters.ParallelInputter([
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="source_words_vocabulary",
              embedding_size=512),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_1_vocabulary",
              embedding_size=16),
          onmt.inputters.WordEmbedder(
              vocabulary_file_key="feature_2_vocabulary",

This file has been truncated. show original

For your document embedding, you can either use a WordEmbedder and configure pretrained embeddings, or directly store the document vectors in a TFRecord file:

http://opennmt.net/OpenNMT-tf/data.html#vectors

Let me know how it goes.

valentinmace · April 13, 2019, 11:16am

Great answer, I will let you know how it goes

I’ve only one remaining question:
Is using the pattern “word|doc_number” in my training file the right way of doing it ?

I guess the first WordEmbedder will correspond to word and the second to doc_number but is using the concatenation character the good method ?

Thanks

guillaumekln · April 13, 2019, 3:11pm

No, OpenNMT-tf removed this syntax which can be error prone. Instead, you should provide two separate and aligned files.

valentinmace · April 13, 2019, 5:17pm

If I understand correctly my yml config will have

data:
  train_features_file:
    - train_file
    - train_file_docs

With train_file looking like

The cat is here .
I like dogs .

And train_file_docs looking like

<doc1> <doc1> <doc1> <doc1> <doc1>
<doc2> <doc2> <doc2> <doc2>

(Implying these phrases belong to different docs)

I guess the order of these files in the yml is corresponding to the order declared in the ParallelInputter ?

Also what happens if the alignment (word wise) in a sentence is not correct ?

Thanks for your time

guillaumekln · April 16, 2019, 7:48am

Yes.

If the number of tokens is not the same, it will most likely crash at some point.

valentinmace · April 22, 2019, 7:45pm

I was able to test your solution, and it seems to work fine

However I have a problem related to what I’m trying to do, in fact my document embeddings file is very large (dozens of GB)

When I train my model using this embeddings file I get

`ValueError: Cannot create a tensor proto whose content is larger than 2GB.`

And I guess it is related to the size of my document embedding file since when I try with a small file it does not show any error

Maybe this question needs its own thread and I will create it if your think it’s better