Simple guide for custom pre-trained embeddings

netxiao · February 16, 2017, 2:49pm

http://opennmt.net/Advanced/#pre-trained-embeddings

##Pre-trained embeddings

When training with small amounts of data, performance can be improved
by starting with pretrained embeddings. The arguments
-pre_word_vecs_dec and -pre_word_vecs_enc can be used to specify
these files. The pretrained embeddings must be manually constructed
torch serialized matrices that correspond to the src and tgt
dictionary files. By default these embeddings will be updated during
training, but they can be held fixed using -fix_word_vecs_enc and
-fix_word_vecs_dec.

The detailed configuration procedure as below:

##install google word2vec

download word2vec from https://code.google.com/archive/p/word2vec/
wget https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip
tar -zxf word2vec.tgz
cd word2vec
make

Training words to vectors and then output binary files.

nohup word2vec -train ../train.src.tok  -output vectors.src.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 2 -binary 1 > src.log &

nohup word2vec -train ../train.tgt.tok  -output vectors.tgt.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 2 -binary 1 > tgt.log &

Convert binary files to torch readable file format

--install zlib
sudo apt-get install  zlibc zlib1g zlib1g-dev 
luarocks install lzlib 


wget https://raw.githubusercontent.com/jroakes/OpenNMT/master/tools/embedding_convert.lua 

--convert file
th tools/embedding_convert.lua -embed_type word2vec -embed_file vectors.src.bin -dict_file dict_src -save_data train/vectors.src.t7

th tools/embedding_convert.lua -embed_type word2vec -embed_file vectors.tgt.bin -dict_file dict_tgt -save_data train/vectors.tgt.t7

ref:
https://github.com/jroakes/OpenNMT/tree/master/tools#embedding-conversion
https://github.com/OpenNMT/OpenNMT/issues/23

Etienne38 · February 17, 2017, 2:10pm

Did I miss something ? I didn’t succeed in finding how to retrieve this code…(!?)

Remark : I have DL4J Word2vec installed. I will give it a try…

Etienne38 · February 17, 2017, 2:46pm

The SVN link provided on the page seems not to work. But, following the “source” link, this direct archive download link worked:
https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip

phongnt · February 25, 2017, 9:54am

This github repo was automatically exported from code.google.com/p/word2vec:

https://github.com/tmikolov/word2vec

srush · February 28, 2017, 9:23pm

Here’s a link for many different languages. Would be neat to support this format https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Etienne38 · March 1, 2017, 1:42pm

embedding_convert is no more working with latest version of ONMT, because of the lack of this file:
onmt/utils/Opt.lua

Of course, it’s still working with my previous old ONMT installation.

senisioi · March 2, 2017, 9:30am

Based on the original script I have made another one that concatenates two pre-trained word2vec embeddings.
But you can also use it to convert word2vec embeddings like so:
th tools/concat_embedding.lua -dict_file $PATH_TO_DICT -global_embed $PATH_TO_EMBED -save_data $PATH_TO_SAVE

The script is meant to concatenate a global embedding (e.g. trained on out-of-domain like Google News) with a local one trained on the corpus.

srush · March 2, 2017, 1:45pm

Thanks, I filed a bug report.

ZexCeedd · March 19, 2017, 8:32pm

Still new to lua.
Having problems with zlib, using the PR convert embedding code

error loading module 'zlib' from file '/usr/local/lib/lua/5.1/zlib.so':
	/usr/local/lib/lua/5.1/zlib.so: undefined symbol: lua_tointeger
stack traceback:
	[C]: in function 'error'
	/home/user/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
	tools/embedding_convert.lua:4: in main chunk
	[C]: in function 'dofile'
	...olas/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: in ?

guillaumekln · April 6, 2017, 1:27pm

There is an updated version here:

https://github.com/OpenNMT/OpenNMT/pull/191

Hopefully we can merge this script someday.

guillaumekln · April 18, 2017, 1:01pm

The script is finally merged into master.

You will find the related documentation here and the options summary here.

miharc · April 19, 2017, 4:42pm

Hi,

I am trying to use pretrained embeddings in the training step, but I get unusual perplexity values = “… Perplexity nan”. Any ideas where I could go wrong? the train command looks like:

th train.lua -data data/en_de/default/preprocess/en_de_80seq_voc100k_v2-train.t7 -save_model data/en_de/default/model/en_de_80seq_brnn_voc100k_wiki500we_dynamic2 -pre_word_vecs_enc data/en_de/default/preprocess/vectors.wiki_abstracts.en-embeddings-500.t7 -pre_word_vecs_dec data/en_de/default/preprocess/vectors.wiki_abstracts.de-embeddings-500.t7 -brnn -gpuid 1

guillaumekln · April 19, 2017, 8:52pm

Hi,

Did you generate the word embedding package with the script above?
Does the nan perplexity appear immediately or after a few iterations?

miharc · April 19, 2017, 9:24pm

Hi, yes used all the scripts from onmt v6 and the nan appeared immediately after the first iteration and it didn’t change even after several epochs.

guillaumekln · May 2, 2017, 10:05am

There was actually an issue when reading word2vec files:

Didn’t other people encounter issues when loading word2vec files?

miharc · May 2, 2017, 5:45pm

thanks, will look into it when my gpu is free again

seisqui · September 15, 2017, 8:39am

Not an immediate answer to your question @ZexCeedd, but somewhat related.

I also ran into a zlib error in running the th tools/embeddings.lua script:
stack traceback:
module ‘zlib’ not found:No LuaRocks module found for zlib
[C]: in function ‘error’
/home/usr/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require’
tools/embeddings.lua:5: in main chunk
[C]: in function ‘dofile’
…ooai/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

I ran:
$ luarocks install lua-zlib

and that seemed to have fixed it.

micaliu153 · May 2, 2018, 4:11pm

luarocks install lzlib
Installing https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master/lzlib-0.4.1.53-1.src.rock

Error: Could not find library file for ZLIB
No file libz.a in /usr/lib
No file libz.so in /usr/lib
No file matching libz.so.* in /usr/lib
You may have to install ZLIB in your system and/or pass ZLIB_DIR or ZLIB_LIBDIR to the luarocks command.
Example: luarocks install lzlib ZLIB_DIR=/usr/local

what’s problem?

jean.senellart · May 23, 2018, 12:49pm

@micaliu153, (a bit late) you do need to install zlib on your system first - on ubuntu apt-get install zlib will work,