Hello all,
I try to mix word and document embeddings
Let’s say:
I have pretrained and fixed classic word embeddings (dim 512).
My corpus is made of documents so I also have document embeddings for each document (dim 512).
I know for each word in which document it appears
I’ve heard I can create a corpus like this (see below) and tell opennmt-tf to concatenate (say I want to concatenate here) embeddings of each word with its corresponding document embedding during training/inference
"The|doc1 cat|doc1 is|doc1 …
Fruits|doc2 are|doc2 delicious|doc2 …"
However I don’t know where to start to train my model using this possibility
Thank you in advance,
Valentin
Hi,
You should define a model with a ParallelInputter
as the input layer. See for example this custom model that concatenates 3 input streams on the depth dimension:
"""Defines a sequence to sequence model with multiple input features. For
example, this could be words, parts of speech, and lemmas that are embedded in
parallel and concatenated into a single input embedding. The features are
separate data files with separate vocabularies.
"""
import tensorflow as tf
import opennmt as onmt
def model():
return onmt.models.SequenceToSequence(
source_inputter=onmt.inputters.ParallelInputter([
onmt.inputters.WordEmbedder(
vocabulary_file_key="source_words_vocabulary",
embedding_size=512),
onmt.inputters.WordEmbedder(
vocabulary_file_key="feature_1_vocabulary",
embedding_size=16),
onmt.inputters.WordEmbedder(
vocabulary_file_key="feature_2_vocabulary",
This file has been truncated. show original
For your document embedding, you can either use a WordEmbedder
and configure pretrained embeddings, or directly store the document vectors in a TFRecord file:
http://opennmt.net/OpenNMT-tf/data.html#vectors
Let me know how it goes.
1 Like
Great answer, I will let you know how it goes
I’ve only one remaining question:
Is using the pattern “word|doc_number” in my training file the right way of doing it ?
I guess the first WordEmbedder will correspond to word and the second to doc_number but is using the concatenation character the good method ?
Thanks
No, OpenNMT-tf removed this syntax which can be error prone. Instead, you should provide two separate and aligned files.
1 Like
If I understand correctly my yml config will have
data:
train_features_file:
- train_file
- train_file_docs
With train_file looking like
The cat is here .
I like dogs .
And train_file_docs looking like
<doc1> <doc1> <doc1> <doc1> <doc1>
<doc2> <doc2> <doc2> <doc2>
(Implying these phrases belong to different docs)
I guess the order of these files in the yml is corresponding to the order declared in the ParallelInputter ?
Also what happens if the alignment (word wise) in a sentence is not correct ?
Thanks for your time
Yes.
If the number of tokens is not the same, it will most likely crash at some point.
I was able to test your solution, and it seems to work fine
However I have a problem related to what I’m trying to do, in fact my document embeddings file is very large (dozens of GB)
When I train my model using this embeddings file I get
`ValueError: Cannot create a tensor proto whose content is larger than 2GB.`
And I guess it is related to the size of my document embedding file since when I try with a small file it does not show any error
Maybe this question needs its own thread and I will create it if your think it’s better