New Feature: Embeddings and Visualization

srush · February 26, 2017, 6:33pm

By request of @Etienne38 we now have a tool for extracting embeddings from the system.

th tools/extract_embeddings.lua -model model.t7

It will produce two files src_embeddings.txt and tgt_embeddings.txt.

Each list is of the standard format:

<word> <val1> <val2> <val3> ....

One neat aspect of this is that you can use standard visualization tools to view the embeddings. For instance here is an example where we use TensorBoard to show a 3D t-SNE embedding of all the verbs from the source side of our summarization model (https://s3.amazonaws.com/opennmt-models/textsum_epoch7_14.69_release.t7)

Here is the tensorflow code that I used (after pip install tensorflow and running extract_embeddings.lua)

import tensorflow as tf                                                                                                                                                                                              
import numpy as np                                                                                                                                                                                                   
import sys, os                                                                                                                                                                                                       
import nltk                                                                                                                                                                                                          
                                                                                                                                                                                                                     
def read_vecs(filename):                                                                                                                                                                                             
    words = []                                                                                                                                                                                                       
    values = []                                                                                                                                                                                                      
    for l in open(filename):                                                                                                                                                                                         
        t = l.split(" ")                                                                                                                                                                                             
        words.append(t[0])                                                                                                                                                                                           
        values.append([float(a) for a in t[1:]])                                                                                                                                                                     
    return words, np.array(values)                                                                                                                                                                                   
                                                                                                                                                                                                                     
def write_metadata(filename, words):                                                                                                                                                                                 
    with open(filename, 'w') as w:                                                                                                                                                                                   
        for word in words:                                                                                                                                                                                           
            w.write(word + "\t" + nltk.pos_tag([word])[0][1][:2] + "\n")                                                                                                                                             
                                                                                                                                                                                                                     
src_words, src_values = read_vecs(sys.argv[1] + "/src_embeddings.txt")                                                                                                                                               
tgt_words, tgt_values = read_vecs(sys.argv[1] + "/tgt_embeddings.txt")                                                                                                                                               
                                                                                                                                                                                                                     
tf.reset_default_graph()                                                                                                                                                                                             
src_embedding_var = tf.Variable(src_values, name="src_embeddings")                                                                                                                                                   
tgt_embedding_var = tf.Variable(tgt_values, name="tgt_embeddings")                                                                                                                                                   
init = tf.global_variables_initializer()                                                                                                                                                                             
with tf.Session() as session:                                                                                                                                                                                        
    session.run(init)                                                                                                                                                                                                
    saver = tf.train.Saver()                                                                                                                                                                                         
    saver.save(session, "/tmp/model.ckpt", 1)                                                                                                                                                                        
                                                                                                                                                                                                                     
write_metadata("/tmp/src_metadata.tsv", src_words)                                                                                                                                                                   
write_metadata("/tmp/tgt_metadata.tsv", tgt_words)                                                                                                                                                                   
                                                                                                                                                                                                                     
from tensorflow.contrib.tensorboard.plugins import projector                                                                                                                                                         
summary_writer = tf.summary.FileWriter("/tmp/")                                                                                                                                                                      
                                                                                                                                                                                                                     
config = projector.ProjectorConfig()                                                                                                                                                                                 
                                                                                                                                                                                                                     
embedding = config.embeddings.add()                                                                                                                                                                                  
embedding.tensor_name = src_embedding_var.name                                                                                                                                                                       
embedding.metadata_path = '/tmp/src_metadata.tsv'                                                                                                                                                                    
                                                                                                                                                                                                                     
embedding = config.embeddings.add()                                                                                                                                                                                  
embedding.tensor_name = tgt_embedding_var.name                                                                                                                                                                       
embedding.metadata_path = '/tmp/tgt_metadata.tsv'                                                                                                                                                                    
                                                                                                                                                                                                                     
projector.visualize_embeddings(summary_writer, config)                                                                                                                                                               
os.system("tensorboard --log=/tmp/")

Etienne38 · February 27, 2017, 12:54pm

extract_embeddings works well. But, when done from a GPU model, it’s very very sloooow. Doing first a release_model to process from a CPU model enables to speed it up.

It would be nice to also extract features embeddings.

I succeeded in making TensorBoard working,with some tunings:

“sudo” the tensorflow installation in order to properly get tensorboard (!?)
sudo pip install tensorflow
also install nltk
sudo pip install nltk
this add to the script
nltk.download('averaged_perceptron_tagger')

Etienne38 · February 27, 2017, 3:42pm

The first line of each metadata TSV files needs to be labels for columns. Without it, all lines will be shifted, and the viewing tool complains about a difference of 1 entry between files.

def write_metadata(filename, words):                                                                                                                                                                                 
    with open(filename, 'w') as w:                                                                                                                                                                                   
        w.write("Word\tPOS\n")                                                                                                                                             
        for word in words:                                                                                                                                                                                           
            w.write(word + "\t" + nltk.pos_tag([word])[0][1][:2] + "\n")

srush · February 27, 2017, 3:46pm

Oh good point. I will have it automatically force CPU mode in extract_embeddings.

The nltk step is slow as well… Maybe I will remove that from the visualizer.

Yes, we can do features. Would that be helpful? Just modify my :apply function to pull out that lookup table as well, and use the feature dict to get the names.

jean.senellart · February 27, 2017, 10:48pm

Super cool !

Etienne38 · February 28, 2017, 8:56am

I’m experimenting with multiplexed streams, like done here:

Thus, the features streams are nearly as important as the main words stream. To understand what the network is doing with them, I need to have a look at the built embeddings.

I succeeded with the very first step, but then, all seem not that simple…

srush · February 28, 2017, 5:15pm

Maybe bother me on https://gitter.im/OpenNMT/openmt I can help out if you show me what you have…

Etienne38 · February 28, 2017, 5:39pm

Knowing this…

…having an export of ONMT embeddings is of much less interest for me. I would rather now get back to the code I started to write to build my own embeddings, and configure ONMT to use them with fixed embeddings option.

srush · February 28, 2017, 9:22pm

Not sure I understand. How were you expecting word vectors to work?

Etienne38 · March 1, 2017, 6:34am

I expected them to define a good data topology in input/output spaces, as explained here:

srush · March 1, 2017, 4:19pm

Oh I see. Yeah, embeddings learned in a (bi)RNN will have different properties than word2vec. I am not sure if you can say one is better or worse. One of the powers of word2vec is that it is a mostly linear model, so the relationships are a bit more interpretable.

Etienne38 · March 1, 2017, 4:59pm

Word2vec is building a lookup table, and I don’t see why it should have something linear. Did I misunderstand something ?

srush · March 1, 2017, 10:53pm

It’s learning objective is bi-linear in the embeddings, whereas RNN is highly non-linear.

msobroza · March 29, 2017, 3:48pm

I had one problem:

ValueError: Cannot create a tensor proto whose content is larger than 2GB.

Anyone knows how can solve this ?

Thanks you !

lucien0410 · March 21, 2018, 5:54pm

Hi,

Is this tool of extracting word embedding ONLY implemented in the lua version of OpenNMT? Does the python version have same same tool? Any clue? Thanks!

srush · March 21, 2018, 6:48pm

github.com

OpenNMT/OpenNMT-py/blob/master/tools/extract_embeddings.py

from __future__ import division
import torch
import argparse
import opts
import onmt
import onmt.ModelConstructor
import onmt.io
from onmt.Utils import use_gpu

parser = argparse.ArgumentParser(description='translate.py')

parser.add_argument('-model', required=True,
                    help='Path to model .pt file')
parser.add_argument('-output_dir', default='.',
                    help="""Path to output the embeddings""")
parser.add_argument('-gpu', type=int, default=-1,
                    help="Device to run on")


def write_embeddings(filename, dict, embeddings):

This file has been truncated. show original

lucien0410 · March 21, 2018, 8:13pm

Thanks! This script is indeed in the ‘tools’ folder (’/OpenNMT-py/tools’). I just don’t find it mentioned in the documentation (http://opennmt.net/OpenNMT-py/main.html).

Thank you so much any way!

rrifaldiu · July 27, 2020, 1:22pm

Hello. I am currently using OpenNMT-tf. I wonder is this features also available in OpenNMT-tf? Thanks in advance

guillaumekln · July 27, 2020, 1:27pm

Yes, embeddings can be visualized in TensorBoard during the training. Just start a TensorBoard instance and click on “Projector”.

rrifaldiu · July 27, 2020, 1:38pm

How about the extraction? Can we extract the embedding from the system?