Greetings!
I want to create summarization seq2seq-based model with OpenNMT. This tool in amazing! At the moment I try to tune model for my needs. I try to use the following fact: input and output texts use the same language.
Q1. Is it possible to use shared vocabulary for input/output?
Q2: I have seen that the same words in *.dict files have different IDs. Is it correct for my case? Is it possible to make them identical?
Q3: Is it possible to share weights between encoder/decoder lookup tables?
Q4. I try to use pretrained word2vec to initialize encoder and decoder tables. Is it correct workflow:
Step1: preprocess parallel corpus:
th preprocess.lua
-train_src corpus/train_sentences.src
-train_tgt corpus/train_sentences.tgt
-valid_src corpus/val_sentences.src
-valid_tgt corpus/val_sentences.tgt
-save_data data/train/textsum
Step2. initialize embeddings for input:
th tools/embeddings.lua
-embed_type word2vec-bin
-embed_file word2vec/gensim_weights.bin
-dict_file train/textsum.src.dict
-save_data embeddings/embeddings_src.bin
Step3. initialize embeddings for output:
th tools/embeddings.lua
-embed_type word2vec-bin
-embed_file word2vec/gensim_weights.bin
-dict_file train/textsum.tgt.dict
-save_data embeddings/embeddings_tgt.bin
train with embeddings:
th train.lua
-data data/train/textsum-train.t7
-save_model textsum
-validation_metric bleu
-pre_word_vecs_enc embeddings/embeddings_src.bin-embeddings-300.t7
-src_word_vec_size 300
-pre_word_vecs_dec embeddings/embeddings_src.bin-embeddings-300.t7
-tgt_word_vec_size 300
-gpuid 1
Cheers,
Michael