Simple guide for custom pre-trained embeddings

(Netxiao) #1

##Pre-trained embeddings

When training with small amounts of data, performance can be improved
by starting with pretrained embeddings. The arguments
-pre_word_vecs_dec and -pre_word_vecs_enc can be used to specify
these files. The pretrained embeddings must be manually constructed
torch serialized matrices that correspond to the src and tgt
dictionary files. By default these embeddings will be updated during
training, but they can be held fixed using -fix_word_vecs_enc and

The detailed configuration procedure as below:

##install google word2vec

download word2vec from
tar -zxf word2vec.tgz
cd word2vec

Training words to vectors and then output binary files.

nohup word2vec -train ../train.src.tok  -output vectors.src.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 2 -binary 1 > src.log &

nohup word2vec -train ../train.tgt.tok  -output vectors.tgt.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 2 -binary 1 > tgt.log &

Convert binary files to torch readable file format

--install zlib
sudo apt-get install  zlibc zlib1g zlib1g-dev 
luarocks install lzlib 


--convert file
th tools/embedding_convert.lua -embed_type word2vec -embed_file vectors.src.bin -dict_file dict_src -save_data train/vectors.src.t7

th tools/embedding_convert.lua -embed_type word2vec -embed_file vectors.tgt.bin -dict_file dict_tgt -save_data train/vectors.tgt.t7


(Etienne Monneret) #2

Did I miss something ? I didn’t succeed in finding how to retrieve this code…(!?)

Remark : I have DL4J Word2vec installed. I will give it a try…

(Etienne Monneret) #3

The SVN link provided on the page seems not to work. But, following the “source” link, this direct archive download link worked:

(Nguyen Tuan Phong) #4

This github repo was automatically exported from


(srush) #5

Here’s a link for many different languages. Would be neat to support this format

(Etienne Monneret) #6

embedding_convert is no more working with latest version of ONMT, because of the lack of this file:

Of course, it’s still working with my previous old ONMT installation.

(Sergiu) #7

Based on the original script I have made another one that concatenates two pre-trained word2vec embeddings.
But you can also use it to convert word2vec embeddings like so:
th tools/concat_embedding.lua -dict_file $PATH_TO_DICT -global_embed $PATH_TO_EMBED -save_data $PATH_TO_SAVE

The script is meant to concatenate a global embedding (e.g. trained on out-of-domain like Google News) with a local one trained on the corpus.

(srush) #8

Thanks, I filed a bug report.

(Zex Ceedd) #9

Still new to lua.
Having problems with zlib, using the PR convert embedding code :frowning:

error loading module 'zlib' from file '/usr/local/lib/lua/5.1/':
	/usr/local/lib/lua/5.1/ undefined symbol: lua_tointeger
stack traceback:
	[C]: in function 'error'
	/home/user/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
	tools/embedding_convert.lua:4: in main chunk
	[C]: in function 'dofile'
	...olas/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: in ?

(Guillaume Klein) #10

There is an updated version here:

Hopefully we can merge this script someday. :slight_smile:

(Guillaume Klein) #11

The script is finally merged into master. :tada:

You will find the related documentation here and the options summary here.

(Mihael) #12


I am trying to use pretrained embeddings in the training step, but I get unusual perplexity values = “… Perplexity nan”. Any ideas where I could go wrong? the train command looks like:

th train.lua -data data/en_de/default/preprocess/en_de_80seq_voc100k_v2-train.t7 -save_model data/en_de/default/model/en_de_80seq_brnn_voc100k_wiki500we_dynamic2 -pre_word_vecs_enc data/en_de/default/preprocess/vectors.wiki_abstracts.en-embeddings-500.t7 -pre_word_vecs_dec data/en_de/default/preprocess/ -brnn -gpuid 1

(Guillaume Klein) #13


Did you generate the word embedding package with the script above?
Does the nan perplexity appear immediately or after a few iterations?

(Mihael) #14

Hi, yes used all the scripts from onmt v6 and the nan appeared immediately after the first iteration and it didn’t change even after several epochs.

(Guillaume Klein) #15

There was actually an issue when reading word2vec files:

Didn’t other people encounter issues when loading word2vec files?

(Mihael) #16

thanks, will look into it when my gpu is free again

(SeisQ) #17

Not an immediate answer to your question @ZexCeedd, but somewhat related.

I also ran into a zlib error in running the th tools/embeddings.lua script:
stack traceback:
module ‘zlib’ not found:No LuaRocks module found for zlib
[C]: in function ‘error’
/home/usr/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require’
tools/embeddings.lua:5: in main chunk
[C]: in function ‘dofile’
…ooai/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

I ran:
$ luarocks install lua-zlib

and that seemed to have fixed it.

(Micahliu153) #18

luarocks install lzlib

Error: Could not find library file for ZLIB
No file libz.a in /usr/lib
No file in /usr/lib
No file matching* in /usr/lib
You may have to install ZLIB in your system and/or pass ZLIB_DIR or ZLIB_LIBDIR to the luarocks command.
Example: luarocks install lzlib ZLIB_DIR=/usr/local

what’s problem?

(jean.senellart) #19

@micaliu153, (a bit late) you do need to install zlib on your system first - on ubuntu apt-get install zlib will work,