Simple guide for custom pre-trained embeddings

http://opennmt.net/Advanced/#pre-trained-embeddings

##Pre-trained embeddings

When training with small amounts of data, performance can be improved
by starting with pretrained embeddings. The arguments
-pre_word_vecs_dec and -pre_word_vecs_enc can be used to specify
these files. The pretrained embeddings must be manually constructed
torch serialized matrices that correspond to the src and tgt
dictionary files. By default these embeddings will be updated during
training, but they can be held fixed using -fix_word_vecs_enc and
-fix_word_vecs_dec.

The detailed configuration procedure as below:

##install google word2vec

download word2vec from https://code.google.com/archive/p/word2vec/
wget https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip
tar -zxf word2vec.tgz
cd word2vec
make

Training words to vectors and then output binary files.

nohup word2vec -train ../train.src.tok  -output vectors.src.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 2 -binary 1 > src.log &

nohup word2vec -train ../train.tgt.tok  -output vectors.tgt.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 2 -binary 1 > tgt.log &

Convert binary files to torch readable file format

--install zlib
sudo apt-get install  zlibc zlib1g zlib1g-dev 
luarocks install lzlib 


wget https://raw.githubusercontent.com/jroakes/OpenNMT/master/tools/embedding_convert.lua 

--convert file
th tools/embedding_convert.lua -embed_type word2vec -embed_file vectors.src.bin -dict_file dict_src -save_data train/vectors.src.t7

th tools/embedding_convert.lua -embed_type word2vec -embed_file vectors.tgt.bin -dict_file dict_tgt -save_data train/vectors.tgt.t7

ref:
https://github.com/jroakes/OpenNMT/tree/master/tools#embedding-conversion
https://github.com/OpenNMT/OpenNMT/issues/23

2 Likes

Did I miss something ? I didnā€™t succeed in finding how to retrieve this codeā€¦(!?)

Remark : I have DL4J Word2vec installed. I will give it a tryā€¦

The SVN link provided on the page seems not to work. But, following the ā€œsourceā€ link, this direct archive download link worked:
https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip
:slight_smile:

This github repo was automatically exported from code.google.com/p/word2vec:

https://github.com/tmikolov/word2vec

:grinning:

Hereā€™s a link for many different languages. Would be neat to support this format https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

embedding_convert is no more working with latest version of ONMT, because of the lack of this file:
onmt/utils/Opt.lua

Of course, itā€™s still working with my previous old ONMT installation.
:stuck_out_tongue:

1 Like

Based on the original script I have made another one that concatenates two pre-trained word2vec embeddings.
But you can also use it to convert word2vec embeddings like so:
th tools/concat_embedding.lua -dict_file $PATH_TO_DICT -global_embed $PATH_TO_EMBED -save_data $PATH_TO_SAVE

The script is meant to concatenate a global embedding (e.g. trained on out-of-domain like Google News) with a local one trained on the corpus.

3 Likes

Thanks, I filed a bug report.

Still new to lua.
Having problems with zlib, using the PR convert embedding code :frowning:

error loading module 'zlib' from file '/usr/local/lib/lua/5.1/zlib.so':
	/usr/local/lib/lua/5.1/zlib.so: undefined symbol: lua_tointeger
stack traceback:
	[C]: in function 'error'
	/home/user/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
	tools/embedding_convert.lua:4: in main chunk
	[C]: in function 'dofile'
	...olas/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: in ?

There is an updated version here:

https://github.com/OpenNMT/OpenNMT/pull/191

Hopefully we can merge this script someday. :slight_smile:

1 Like

The script is finally merged into master. :tada:

You will find the related documentation here and the options summary here.

1 Like

Hi,

I am trying to use pretrained embeddings in the training step, but I get unusual perplexity values = ā€œā€¦ Perplexity nanā€. Any ideas where I could go wrong? the train command looks like:

th train.lua -data data/en_de/default/preprocess/en_de_80seq_voc100k_v2-train.t7 -save_model data/en_de/default/model/en_de_80seq_brnn_voc100k_wiki500we_dynamic2 -pre_word_vecs_enc data/en_de/default/preprocess/vectors.wiki_abstracts.en-embeddings-500.t7 -pre_word_vecs_dec data/en_de/default/preprocess/vectors.wiki_abstracts.de-embeddings-500.t7 -brnn -gpuid 1

Hi,

Did you generate the word embedding package with the script above?
Does the nan perplexity appear immediately or after a few iterations?

Hi, yes used all the scripts from onmt v6 and the nan appeared immediately after the first iteration and it didnā€™t change even after several epochs.

There was actually an issue when reading word2vec files:

Didnā€™t other people encounter issues when loading word2vec files?

thanks, will look into it when my gpu is free again

Not an immediate answer to your question @ZexCeedd, but somewhat related.

I also ran into a zlib error in running the th tools/embeddings.lua script:
stack traceback:
module ā€˜zlibā€™ not found:No LuaRocks module found for zlib
[C]: in function ā€˜errorā€™
/home/usr/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'requireā€™
tools/embeddings.lua:5: in main chunk
[C]: in function ā€˜dofileā€™
ā€¦ooai/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

I ran:
$ luarocks install lua-zlib

and that seemed to have fixed it.

luarocks install lzlib
Installing https://raw.githubusercontent.com/rocks-moonscript-org/moonrocks-mirror/master/lzlib-0.4.1.53-1.src.rock

Error: Could not find library file for ZLIB
No file libz.a in /usr/lib
No file libz.so in /usr/lib
No file matching libz.so.* in /usr/lib
You may have to install ZLIB in your system and/or pass ZLIB_DIR or ZLIB_LIBDIR to the luarocks command.
Example: luarocks install lzlib ZLIB_DIR=/usr/local

whatā€™s problem?

@micaliu153, (a bit late) you do need to install zlib on your system first - on ubuntu apt-get install zlib will work,