When training with small amounts of data, performance can be improved
by starting with pretrained embeddings. The arguments
-pre_word_vecs_dec and -pre_word_vecs_enc can be used to specify
these files. The pretrained embeddings must be manually constructed
torch serialized matrices that correspond to the src and tgt
dictionary files. By default these embeddings will be updated during
training, but they can be held fixed using -fix_word_vecs_enc and
-fix_word_vecs_dec.
The detailed configuration procedure as below:
##install google word2vec
download word2vec from https://code.google.com/archive/p/word2vec/
wget https://storage.googleapis.com/google-code-archive-source/v2/code.google.com/word2vec/source-archive.zip
tar -zxf word2vec.tgz
cd word2vec
make
Training words to vectors and then output binary files.
Based on the original script I have made another one that concatenates two pre-trained word2vec embeddings.
But you can also use it to convert word2vec embeddings like so: th tools/concat_embedding.lua -dict_file $PATH_TO_DICT -global_embed $PATH_TO_EMBED -save_data $PATH_TO_SAVE
The script is meant to concatenate a global embedding (e.g. trained on out-of-domain like Google News) with a local one trained on the corpus.
Still new to lua.
Having problems with zlib, using the PR convert embedding code
error loading module 'zlib' from file '/usr/local/lib/lua/5.1/zlib.so':
/usr/local/lib/lua/5.1/zlib.so: undefined symbol: lua_tointeger
stack traceback:
[C]: in function 'error'
/home/user/torch/install/share/lua/5.2/trepl/init.lua:389: in function 'require'
tools/embedding_convert.lua:4: in main chunk
[C]: in function 'dofile'
...olas/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: in ?
I am trying to use pretrained embeddings in the training step, but I get unusual perplexity values = āā¦ Perplexity nanā. Any ideas where I could go wrong? the train command looks like:
Not an immediate answer to your question @ZexCeedd, but somewhat related.
I also ran into a zlib error in running the th tools/embeddings.lua script:
stack traceback:
module āzlibā not found:No LuaRocks module found for zlib
[C]: in function āerrorā
/home/usr/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'requireā
tools/embeddings.lua:5: in main chunk
[C]: in function ādofileā
ā¦ooai/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50
Error: Could not find library file for ZLIB
No file libz.a in /usr/lib
No file libz.so in /usr/lib
No file matching libz.so.* in /usr/lib
You may have to install ZLIB in your system and/or pass ZLIB_DIR or ZLIB_LIBDIR to the luarocks command.
Example: luarocks install lzlib ZLIB_DIR=/usr/local