New Feature: Embeddings and Visualization

By request of @Etienne38 we now have a tool for extracting embeddings from the system.

th tools/extract_embeddings.lua -model model.t7

It will produce two files src_embeddings.txt and tgt_embeddings.txt.

Each list is of the standard format:

<word> <val1> <val2> <val3> ....

One neat aspect of this is that you can use standard visualization tools to view the embeddings. For instance here is an example where we use TensorBoard to show a 3D t-SNE embedding of all the verbs from the source side of our summarization model (

Here is the tensorflow code that I used (after pip install tensorflow and running extract_embeddings.lua)

import tensorflow as tf                                                                                                                                                                                              
import numpy as np                                                                                                                                                                                                   
import sys, os                                                                                                                                                                                                       
import nltk                                                                                                                                                                                                          
def read_vecs(filename):                                                                                                                                                                                             
    words = []                                                                                                                                                                                                       
    values = []                                                                                                                                                                                                      
    for l in open(filename):                                                                                                                                                                                         
        t = l.split(" ")                                                                                                                                                                                             
        values.append([float(a) for a in t[1:]])                                                                                                                                                                     
    return words, np.array(values)                                                                                                                                                                                   
def write_metadata(filename, words):                                                                                                                                                                                 
    with open(filename, 'w') as w:                                                                                                                                                                                   
        for word in words:                                                                                                                                                                                           
            w.write(word + "\t" + nltk.pos_tag([word])[0][1][:2] + "\n")                                                                                                                                             
src_words, src_values = read_vecs(sys.argv[1] + "/src_embeddings.txt")                                                                                                                                               
tgt_words, tgt_values = read_vecs(sys.argv[1] + "/tgt_embeddings.txt")                                                                                                                                               
src_embedding_var = tf.Variable(src_values, name="src_embeddings")                                                                                                                                                   
tgt_embedding_var = tf.Variable(tgt_values, name="tgt_embeddings")                                                                                                                                                   
init = tf.global_variables_initializer()                                                                                                                                                                             
with tf.Session() as session:                                                                                                                                                                                                                                                                                                                                                                                   
    saver = tf.train.Saver()                                                                                                                                                                                    , "/tmp/model.ckpt", 1)                                                                                                                                                                        
write_metadata("/tmp/src_metadata.tsv", src_words)                                                                                                                                                                   
write_metadata("/tmp/tgt_metadata.tsv", tgt_words)                                                                                                                                                                   
from tensorflow.contrib.tensorboard.plugins import projector                                                                                                                                                         
summary_writer = tf.summary.FileWriter("/tmp/")                                                                                                                                                                      
config = projector.ProjectorConfig()                                                                                                                                                                                 
embedding = config.embeddings.add()                                                                                                                                                                                  
embedding.tensor_name =                                                                                                                                                                       
embedding.metadata_path = '/tmp/src_metadata.tsv'                                                                                                                                                                    
embedding = config.embeddings.add()                                                                                                                                                                                  
embedding.tensor_name =                                                                                                                                                                       
embedding.metadata_path = '/tmp/tgt_metadata.tsv'                                                                                                                                                                    
projector.visualize_embeddings(summary_writer, config)                                                                                                                                                               
os.system("tensorboard --log=/tmp/")  

Word-vec config?
extract_embeddings works well. But, when done from a GPU model, it’s very very sloooow. Doing first a release_model to process from a CPU model enables to speed it up.

It would be nice to also extract features embeddings.

I succeeded in making TensorBoard working,with some tunings:

  1. “sudo” the tensorflow installation in order to properly get tensorboard (!?)
    sudo pip install tensorflow
  2. also install nltk
    sudo pip install nltk
  3. this add to the script'averaged_perceptron_tagger')


The first line of each metadata TSV files needs to be labels for columns. Without it, all lines will be shifted, and the viewing tool complains about a difference of 1 entry between files.

def write_metadata(filename, words):                                                                                                                                                                                 
    with open(filename, 'w') as w:                                                                                                                                                                                   
        for word in words:                                                                                                                                                                                           
            w.write(word + "\t" + nltk.pos_tag([word])[0][1][:2] + "\n")

Oh good point. I will have it automatically force CPU mode in extract_embeddings.

The nltk step is slow as well… Maybe I will remove that from the visualizer.

Yes, we can do features. Would that be helpful? Just modify my :apply function to pull out that lookup table as well, and use the feature dict to get the names.

Super cool :thumbsup: !

I’m experimenting with multiplexed streams, like done here:

Thus, the features streams are nearly as important as the main words stream. To understand what the network is doing with them, I need to have a look at the built embeddings.

I succeeded with the very first step, but then, all seem not that simple…

Maybe bother me on I can help out if you show me what you have…

Knowing this…

…having an export of ONMT embeddings is of much less interest for me. I would rather now get back to the code I started to write to build my own embeddings, and configure ONMT to use them with fixed embeddings option.


Not sure I understand. How were you expecting word vectors to work?

I expected them to define a good data topology in input/output spaces, as explained here:


Oh I see. Yeah, embeddings learned in a (bi)RNN will have different properties than word2vec. I am not sure if you can say one is better or worse. One of the powers of word2vec is that it is a mostly linear model, so the relationships are a bit more interpretable.

Word2vec is building a lookup table, and I don’t see why it should have something linear. Did I misunderstand something ?

It’s learning objective is bi-linear in the embeddings, whereas RNN is highly non-linear.

I had one problem:

ValueError: Cannot create a tensor proto whose content is larger than 2GB.

Anyone knows how can solve this ?

Thanks you !

Is this tool of extracting word embedding ONLY implemented in the lua version of OpenNMT? Does the python version have same same tool? Any clue? Thanks!

(Yuan-Lu Chen) #17

Thank you so much any way!