New Feature: Embeddings and Visualization

The first line of each metadata TSV files needs to be labels for columns. Without it, all lines will be shifted, and the viewing tool complains about a difference of 1 entry between files.

def write_metadata(filename, words):                                                                                                                                                                                 
    with open(filename, 'w') as w:                                                                                                                                                                                   
        w.write("Word\tPOS\n")                                                                                                                                             
        for word in words:                                                                                                                                                                                           
            w.write(word + "\t" + nltk.pos_tag([word])[0][1][:2] + "\n")
1 Like

Oh good point. I will have it automatically force CPU mode in extract_embeddings.

The nltk step is slow as well… Maybe I will remove that from the visualizer.

Yes, we can do features. Would that be helpful? Just modify my :apply function to pull out that lookup table as well, and use the feature dict to get the names.

Super cool :thumbsup: !

I’m experimenting with multiplexed streams, like done here:

Thus, the features streams are nearly as important as the main words stream. To understand what the network is doing with them, I need to have a look at the built embeddings.

I succeeded with the very first step, but then, all seem not that simple…
:stuck_out_tongue:

Maybe bother me on https://gitter.im/OpenNMT/openmt I can help out if you show me what you have…

Knowing this…

…having an export of ONMT embeddings is of much less interest for me. I would rather now get back to the code I started to write to build my own embeddings, and configure ONMT to use them with fixed embeddings option.

:slight_smile:

Not sure I understand. How were you expecting word vectors to work?

I expected them to define a good data topology in input/output spaces, as explained here:


:wink:

Oh I see. Yeah, embeddings learned in a (bi)RNN will have different properties than word2vec. I am not sure if you can say one is better or worse. One of the powers of word2vec is that it is a mostly linear model, so the relationships are a bit more interpretable.

Word2vec is building a lookup table, and I don’t see why it should have something linear. Did I misunderstand something ?

It’s learning objective is bi-linear in the embeddings, whereas RNN is highly non-linear.

I had one problem:

ValueError: Cannot create a tensor proto whose content is larger than 2GB.

Anyone knows how can solve this ?

Thanks you !

Hi,

Is this tool of extracting word embedding ONLY implemented in the lua version of OpenNMT? Does the python version have same same tool? Any clue? Thanks!

Thanks! This script is indeed in the ‘tools’ folder (’/OpenNMT-py/tools’). I just don’t find it mentioned in the documentation (http://opennmt.net/OpenNMT-py/main.html).

Thank you so much any way!

Hello. I am currently using OpenNMT-tf. I wonder is this features also available in OpenNMT-tf? Thanks in advance

Yes, embeddings can be visualized in TensorBoard during the training. Just start a TensorBoard instance and click on “Projector”.

How about the extraction? Can we extract the embedding from the system?

See:

Oh, okay thank you so much!