Shared vocabulary for summarization problem



I want to create summarization seq2seq-based model with OpenNMT. This tool in amazing! At the moment I try to tune model for my needs. I try to use the following fact: input and output texts use the same language.

Q1. Is it possible to use shared vocabulary for input/output?
Q2: I have seen that the same words in *.dict files have different IDs. Is it correct for my case? Is it possible to make them identical?
Q3: Is it possible to share weights between encoder/decoder lookup tables?
Q4. I try to use pretrained word2vec to initialize encoder and decoder tables. Is it correct workflow:

Step1: preprocess parallel corpus:
th preprocess.lua
-train_src corpus/train_sentences.src
-train_tgt corpus/train_sentences.tgt
-valid_src corpus/val_sentences.src
-valid_tgt corpus/val_sentences.tgt
-save_data data/train/textsum

Step2. initialize embeddings for input:
th tools/embeddings.lua
-embed_type word2vec-bin
-embed_file word2vec/gensim_weights.bin
-dict_file train/textsum.src.dict
-save_data embeddings/embeddings_src.bin

Step3. initialize embeddings for output:
th tools/embeddings.lua
-embed_type word2vec-bin
-embed_file word2vec/gensim_weights.bin
-dict_file train/textsum.tgt.dict
-save_data embeddings/embeddings_tgt.bin

train with embeddings:
th train.lua
-data data/train/textsum-train.t7
-save_model textsum
-validation_metric bleu
-pre_word_vecs_enc embeddings/embeddings_src.bin-embeddings-300.t7
-src_word_vec_size 300
-pre_word_vecs_dec embeddings/embeddings_src.bin-embeddings-300.t7
-tgt_word_vec_size 300
-gpuid 1


(Guillaume Klein) #2


  1. You could use the tools/build_vocabs.lua script to build a single vocabulary then make use of the -src_vocab and -tgt_vocab options during the preprocessing.
  2. That’s not an issue actually. However, with the answer to 1. words will have the same ID but that does not impact the training.
  3. It seems people don’t do this for summarization. It could save memory but you also want to let the network encode an embedding specific to the encoder and decoder.
  4. Yes, the workflow is correct.