New Feature: Embeddings and Visualization


(srush) #1

By request of @Etienne38 we now have a tool for extracting embeddings from the system.

th tools/extract_embeddings.lua -model model.t7

It will produce two files src_embeddings.txt and tgt_embeddings.txt.

Each list is of the standard format:

<word> <val1> <val2> <val3> ....

One neat aspect of this is that you can use standard visualization tools to view the embeddings. For instance here is an example where we use TensorBoard to show a 3D t-SNE embedding of all the verbs from the source side of our summarization model (https://s3.amazonaws.com/opennmt-models/textsum_epoch7_14.69_release.t7)

Here is the tensorflow code that I used (after pip install tensorflow and running extract_embeddings.lua)

import tensorflow as tf                                                                                                                                                                                              
import numpy as np                                                                                                                                                                                                   
import sys, os                                                                                                                                                                                                       
import nltk                                                                                                                                                                                                          
                                                                                                                                                                                                                     
def read_vecs(filename):                                                                                                                                                                                             
    words = []                                                                                                                                                                                                       
    values = []                                                                                                                                                                                                      
    for l in open(filename):                                                                                                                                                                                         
        t = l.split(" ")                                                                                                                                                                                             
        words.append(t[0])                                                                                                                                                                                           
        values.append([float(a) for a in t[1:]])                                                                                                                                                                     
    return words, np.array(values)                                                                                                                                                                                   
                                                                                                                                                                                                                     
def write_metadata(filename, words):                                                                                                                                                                                 
    with open(filename, 'w') as w:                                                                                                                                                                                   
        for word in words:                                                                                                                                                                                           
            w.write(word + "\t" + nltk.pos_tag([word])[0][1][:2] + "\n")                                                                                                                                             
                                                                                                                                                                                                                     
src_words, src_values = read_vecs(sys.argv[1] + "/src_embeddings.txt")                                                                                                                                               
tgt_words, tgt_values = read_vecs(sys.argv[1] + "/tgt_embeddings.txt")                                                                                                                                               
                                                                                                                                                                                                                     
tf.reset_default_graph()                                                                                                                                                                                             
src_embedding_var = tf.Variable(src_values, name="src_embeddings")                                                                                                                                                   
tgt_embedding_var = tf.Variable(tgt_values, name="tgt_embeddings")                                                                                                                                                   
init = tf.global_variables_initializer()                                                                                                                                                                             
with tf.Session() as session:                                                                                                                                                                                        
    session.run(init)                                                                                                                                                                                                
    saver = tf.train.Saver()                                                                                                                                                                                         
    saver.save(session, "/tmp/model.ckpt", 1)                                                                                                                                                                        
                                                                                                                                                                                                                     
write_metadata("/tmp/src_metadata.tsv", src_words)                                                                                                                                                                   
write_metadata("/tmp/tgt_metadata.tsv", tgt_words)                                                                                                                                                                   
                                                                                                                                                                                                                     
from tensorflow.contrib.tensorboard.plugins import projector                                                                                                                                                         
summary_writer = tf.summary.FileWriter("/tmp/")                                                                                                                                                                      
                                                                                                                                                                                                                     
config = projector.ProjectorConfig()                                                                                                                                                                                 
                                                                                                                                                                                                                     
embedding = config.embeddings.add()                                                                                                                                                                                  
embedding.tensor_name = src_embedding_var.name                                                                                                                                                                       
embedding.metadata_path = '/tmp/src_metadata.tsv'                                                                                                                                                                    
                                                                                                                                                                                                                     
embedding = config.embeddings.add()                                                                                                                                                                                  
embedding.tensor_name = tgt_embedding_var.name                                                                                                                                                                       
embedding.metadata_path = '/tmp/tgt_metadata.tsv'                                                                                                                                                                    
                                                                                                                                                                                                                     
projector.visualize_embeddings(summary_writer, config)                                                                                                                                                               
os.system("tensorboard --log=/tmp/")  

Word-vec config?
(Etienne Monneret) #2

extract_embeddings works well. But, when done from a GPU model, it’s very very sloooow. Doing first a release_model to process from a CPU model enables to speed it up.

It would be nice to also extract features embeddings.

I succeeded in making TensorBoard working,with some tunings:

  1. “sudo” the tensorflow installation in order to properly get tensorboard (!?)
    sudo pip install tensorflow
  2. also install nltk
    sudo pip install nltk
  3. this add to the script
    nltk.download('averaged_perceptron_tagger')

:wink:


(Etienne Monneret) #3

The first line of each metadata TSV files needs to be labels for columns. Without it, all lines will be shifted, and the viewing tool complains about a difference of 1 entry between files.

def write_metadata(filename, words):                                                                                                                                                                                 
    with open(filename, 'w') as w:                                                                                                                                                                                   
        w.write("Word\tPOS\n")                                                                                                                                             
        for word in words:                                                                                                                                                                                           
            w.write(word + "\t" + nltk.pos_tag([word])[0][1][:2] + "\n")

(srush) #4

Oh good point. I will have it automatically force CPU mode in extract_embeddings.

The nltk step is slow as well… Maybe I will remove that from the visualizer.

Yes, we can do features. Would that be helpful? Just modify my :apply function to pull out that lookup table as well, and use the feature dict to get the names.


(jean.senellart) #5

Super cool :thumbsup: !


(Etienne Monneret) #6

I’m experimenting with multiplexed streams, like done here:

Thus, the features streams are nearly as important as the main words stream. To understand what the network is doing with them, I need to have a look at the built embeddings.

I succeeded with the very first step, but then, all seem not that simple…
:stuck_out_tongue:


(srush) #7

Maybe bother me on https://gitter.im/OpenNMT/openmt I can help out if you show me what you have…


(Etienne Monneret) #8

Knowing this…

…having an export of ONMT embeddings is of much less interest for me. I would rather now get back to the code I started to write to build my own embeddings, and configure ONMT to use them with fixed embeddings option.

:slight_smile:


(srush) #9

Not sure I understand. How were you expecting word vectors to work?


(Etienne Monneret) #10

I expected them to define a good data topology in input/output spaces, as explained here:


:wink:


(srush) #11

Oh I see. Yeah, embeddings learned in a (bi)RNN will have different properties than word2vec. I am not sure if you can say one is better or worse. One of the powers of word2vec is that it is a mostly linear model, so the relationships are a bit more interpretable.


(Etienne Monneret) #12

Word2vec is building a lookup table, and I don’t see why it should have something linear. Did I misunderstand something ?


(srush) #13

It’s learning objective is bi-linear in the embeddings, whereas RNN is highly non-linear.


(Max Sobroza) #14

I had one problem:

ValueError: Cannot create a tensor proto whose content is larger than 2GB.

Anyone knows how can solve this ?

Thanks you !


(Yuan-Lu Chen) #15

Hi,

Is this tool of extracting word embedding ONLY implemented in the lua version of OpenNMT? Does the python version have same same tool? Any clue? Thanks!


(srush) #16

(Yuan-Lu Chen) #17

Thanks! This script is indeed in the ‘tools’ folder (’/OpenNMT-py/tools’). I just don’t find it mentioned in the documentation (http://opennmt.net/OpenNMT-py/main.html).

Thank you so much any way!