How to use GloVe pre-trained embeddings in OpenNMT-py


(Yuan-Lu Chen) #21

I don’t think it’s the unicode decoding problem. For example, the 52344th line of the embedding of ‘glove.840B.300d.txt’ ( is

‘. . .’ followed by 300 numbers.


l_split = l.decode(‘utf8’).strip().split()

is executed, l_split is

[’.’, ‘.’, ‘.’, ‘-0.1573’, ‘-0.29517’, ‘0.30453’, …

Now the length is not right (303 instead of 301), and the second and the third elements are ‘.’ that cannot be floated.

I figure, we may find the number of the dimensions by counting how many of them can be floated.

def get_dimension_size(line):
	l_split = line.decode('utf8').strip().split()
	for i in l_split:
	return size

Then the size indicates where the boundary between the word and the numbers should be:

def get_embeddings(file):
	embs = dict()
	dimension=get_dimension_size(firstLine) # look at the first line to get the dimension 
	for l in open(file, 'rb').readlines():
		l_split = l.decode('utf8').strip().split()
		if len(l_split) == 2:
		emb=l_split[-1*dimension:] # use the dimension to mark the boundary 
		embs[word] = [float(em) for em in emb]
		print("Got {} embeddings from {}".format(len(embs), file))
	return embs

May not be the most elegant way to solve the problem, but it works …


Actually, I didn’t mean the encoding of the vectors but the unicode character by which the lines are split. For my case for instance, I simply changed l_split = l.decode(‘utf8’).strip().split() to l_split = l.decode('utf8').strip().split(' ') . It was for fastext embeddings.

(Yuan-Lu Chen) #23

Oh! Your simple solution works!

It seems that python’s


takes u’\xa0’ (Non-breaking space) as delimiters.

(dat duong) #24

Hi, I would to confirm that

  1. The embedding is only used for initialization, and that its values will be updated during the training phase.
  2. Is it possible to keep the embedding values unchanged during training phase?

I notice the option -fix_word_vecs_dec, will this solve point #2 ?


(Eva) #25

When you use pre-trained embeddings, you load them when initializing the encoder/decoder, and their value will be updated during training as long as you do not set the -fix_word_vecs_dec or -fix_word_vecs_enc options to true.