New Feature: Embeddings and Visualization

By request of @Etienne38 we now have a tool for extracting embeddings from the system.

th tools/extract_embeddings.lua -model model.t7

It will produce two files src_embeddings.txt and tgt_embeddings.txt.

Each list is of the standard format:

<word> <val1> <val2> <val3> ....

One neat aspect of this is that you can use standard visualization tools to view the embeddings. For instance here is an example where we use TensorBoard to show a 3D t-SNE embedding of all the verbs from the source side of our summarization model (https://s3.amazonaws.com/opennmt-models/textsum_epoch7_14.69_release.t7)

Here is the tensorflow code that I used (after pip install tensorflow and running extract_embeddings.lua)

import tensorflow as tf                                                                                                                                                                                              
import numpy as np                                                                                                                                                                                                   
import sys, os                                                                                                                                                                                                       
import nltk                                                                                                                                                                                                          
                                                                                                                                                                                                                     
def read_vecs(filename):                                                                                                                                                                                             
    words = []                                                                                                                                                                                                       
    values = []                                                                                                                                                                                                      
    for l in open(filename):                                                                                                                                                                                         
        t = l.split(" ")                                                                                                                                                                                             
        words.append(t[0])                                                                                                                                                                                           
        values.append([float(a) for a in t[1:]])                                                                                                                                                                     
    return words, np.array(values)                                                                                                                                                                                   
                                                                                                                                                                                                                     
def write_metadata(filename, words):                                                                                                                                                                                 
    with open(filename, 'w') as w:                                                                                                                                                                                   
        for word in words:                                                                                                                                                                                           
            w.write(word + "\t" + nltk.pos_tag([word])[0][1][:2] + "\n")                                                                                                                                             
                                                                                                                                                                                                                     
src_words, src_values = read_vecs(sys.argv[1] + "/src_embeddings.txt")                                                                                                                                               
tgt_words, tgt_values = read_vecs(sys.argv[1] + "/tgt_embeddings.txt")                                                                                                                                               
                                                                                                                                                                                                                     
tf.reset_default_graph()                                                                                                                                                                                             
src_embedding_var = tf.Variable(src_values, name="src_embeddings")                                                                                                                                                   
tgt_embedding_var = tf.Variable(tgt_values, name="tgt_embeddings")                                                                                                                                                   
init = tf.global_variables_initializer()                                                                                                                                                                             
with tf.Session() as session:                                                                                                                                                                                        
    session.run(init)                                                                                                                                                                                                
    saver = tf.train.Saver()                                                                                                                                                                                         
    saver.save(session, "/tmp/model.ckpt", 1)                                                                                                                                                                        
                                                                                                                                                                                                                     
write_metadata("/tmp/src_metadata.tsv", src_words)                                                                                                                                                                   
write_metadata("/tmp/tgt_metadata.tsv", tgt_words)                                                                                                                                                                   
                                                                                                                                                                                                                     
from tensorflow.contrib.tensorboard.plugins import projector                                                                                                                                                         
summary_writer = tf.summary.FileWriter("/tmp/")                                                                                                                                                                      
                                                                                                                                                                                                                     
config = projector.ProjectorConfig()                                                                                                                                                                                 
                                                                                                                                                                                                                     
embedding = config.embeddings.add()                                                                                                                                                                                  
embedding.tensor_name = src_embedding_var.name                                                                                                                                                                       
embedding.metadata_path = '/tmp/src_metadata.tsv'                                                                                                                                                                    
                                                                                                                                                                                                                     
embedding = config.embeddings.add()                                                                                                                                                                                  
embedding.tensor_name = tgt_embedding_var.name                                                                                                                                                                       
embedding.metadata_path = '/tmp/tgt_metadata.tsv'                                                                                                                                                                    
                                                                                                                                                                                                                     
projector.visualize_embeddings(summary_writer, config)                                                                                                                                                               
os.system("tensorboard --log=/tmp/")  
4 Likes

extract_embeddings works well. But, when done from a GPU model, itā€™s very very sloooow. Doing first a release_model to process from a CPU model enables to speed it up.

It would be nice to also extract features embeddings.

I succeeded in making TensorBoard working,with some tunings:

  1. ā€œsudoā€ the tensorflow installation in order to properly get tensorboard (!?)
    sudo pip install tensorflow
  2. also install nltk
    sudo pip install nltk
  3. this add to the script
    nltk.download('averaged_perceptron_tagger')

:wink:

The first line of each metadata TSV files needs to be labels for columns. Without it, all lines will be shifted, and the viewing tool complains about a difference of 1 entry between files.

def write_metadata(filename, words):                                                                                                                                                                                 
    with open(filename, 'w') as w:                                                                                                                                                                                   
        w.write("Word\tPOS\n")                                                                                                                                             
        for word in words:                                                                                                                                                                                           
            w.write(word + "\t" + nltk.pos_tag([word])[0][1][:2] + "\n")
1 Like

Oh good point. I will have it automatically force CPU mode in extract_embeddings.

The nltk step is slow as wellā€¦ Maybe I will remove that from the visualizer.

Yes, we can do features. Would that be helpful? Just modify my :apply function to pull out that lookup table as well, and use the feature dict to get the names.

Super cool :thumbsup: !

Iā€™m experimenting with multiplexed streams, like done here:

Thus, the features streams are nearly as important as the main words stream. To understand what the network is doing with them, I need to have a look at the built embeddings.

I succeeded with the very first step, but then, all seem not that simpleā€¦
:stuck_out_tongue:

Maybe bother me on https://gitter.im/OpenNMT/openmt I can help out if you show me what you haveā€¦

Knowing thisā€¦

ā€¦having an export of ONMT embeddings is of much less interest for me. I would rather now get back to the code I started to write to build my own embeddings, and configure ONMT to use them with fixed embeddings option.

:slight_smile:

Not sure I understand. How were you expecting word vectors to work?

I expected them to define a good data topology in input/output spaces, as explained here:


:wink:

Oh I see. Yeah, embeddings learned in a (bi)RNN will have different properties than word2vec. I am not sure if you can say one is better or worse. One of the powers of word2vec is that it is a mostly linear model, so the relationships are a bit more interpretable.

Word2vec is building a lookup table, and I donā€™t see why it should have something linear. Did I misunderstand something ?

Itā€™s learning objective is bi-linear in the embeddings, whereas RNN is highly non-linear.

I had one problem:

ValueError: Cannot create a tensor proto whose content is larger than 2GB.

Anyone knows how can solve this ?

Thanks you !

Hi,

Is this tool of extracting word embedding ONLY implemented in the lua version of OpenNMT? Does the python version have same same tool? Any clue? Thanks!

Thanks! This script is indeed in the ā€˜toolsā€™ folder (ā€™/OpenNMT-py/toolsā€™). I just donā€™t find it mentioned in the documentation (http://opennmt.net/OpenNMT-py/main.html).

Thank you so much any way!

Hello. I am currently using OpenNMT-tf. I wonder is this features also available in OpenNMT-tf? Thanks in advance

Yes, embeddings can be visualized in TensorBoard during the training. Just start a TensorBoard instance and click on ā€œProjectorā€.

How about the extraction? Can we extract the embedding from the system?