Replacing word embedding with lstm over character embedding for rare words in Embeddings.py

pytorch

(Ratish Puduppully) #1

Hi, I want to replace word embedding with lstm over character embedding in case the word is rare, for eg: if the count in vocabulary is < 5. It is similar to slide 35 of http://www.phontron.com/slides/emnlp2016-dynet-tutorial-part2.pdf in Dynet reproduced here:

def word_rep(w, cf_init, cb_init):
if wc[w] > 5 :
w_index = vw.w2i[w]
return WORDS_LOOKUP[w_index]
else :
char_ids = [vc.w2i[c] for c in w]
char_embs = [CHARS_LOOKUP[cid] for cid in char_ids]
fw_exps = cf_init. transduce (char_embs)
bw_exps = cb_init. transduce ( reversed (char_embs))
return dy. concatenate ([ fw_exps[- 1 ], bw_exps[- 1 ] ]) `

I studied Embeddings.py. There we first construct embeddings = [nn.Embedding(vocab, dim, padding_idx=pad) for vocab, dim, pad in emb_params]. The definition is not aware of the input word. How do I change the definition to first inspect each word, its count in vocabulary and then determine to either do a direct lookup/ run an LSTM?
Thanks for any leads.
Ratish