How to use GloVe pre-trained embeddings in OpenNMT-py

lucien0410 · February 28, 2018, 11:27pm

@tyahmed
I don’t think it’s the unicode decoding problem. For example, the 52344th line of the embedding of ‘glove.840B.300d.txt’ (http://nlp.stanford.edu/data/glove.840B.300d.zip) is

‘. . .’ followed by 300 numbers.

After

l_split = l.decode(‘utf8’).strip().split()

is executed, l_split is

[’.’, ‘.’, ‘.’, ‘-0.1573’, ‘-0.29517’, ‘0.30453’, …

Now the length is not right (303 instead of 301), and the second and the third elements are ‘.’ that cannot be floated.

@pltrdy
I figure, we may find the number of the dimensions by counting how many of them can be floated.

def get_dimension_size(line):
	size=0
	l_split = line.decode('utf8').strip().split()
	for i in l_split:
		try: 
			_=float(i)
			size=size+1
		except:
			pass
	return size

Then the size indicates where the boundary between the word and the numbers should be:

def get_embeddings(file):
	embs = dict()
	firstLine=open(file,'r').readline()
	dimension=get_dimension_size(firstLine) # look at the first line to get the dimension 
	for l in open(file, 'rb').readlines():
		l_split = l.decode('utf8').strip().split()
		if len(l_split) == 2:
			continue
		emb=l_split[-1*dimension:] # use the dimension to mark the boundary 
		word=l_split[:-1*dimension]
		word=''.join(word)
		embs[word] = [float(em) for em in emb]
		print("Got {} embeddings from {}".format(len(embs), file))
	return embs

May not be the most elegant way to solve the problem, but it works …

tyahmed · March 1, 2018, 10:59am

Actually, I didn’t mean the encoding of the vectors but the unicode character by which the lines are split. For my case for instance, I simply changed l_split = l.decode(‘utf8’).strip().split() to l_split = l.decode('utf8').strip().split(' ') . It was for fastext embeddings.

lucien0410 · March 1, 2018, 11:57pm

Oh! Your simple solution works!

It seems that python’s

str.split()

takes u’\xa0’ (Non-breaking space) as delimiters.

datddd · March 6, 2018, 6:12am

Hi, I would to confirm that

The embedding is only used for initialization, and that its values will be updated during the training phase.
Is it possible to keep the embedding values unchanged during training phase?

I notice the option -fix_word_vecs_dec, will this solve point #2 ?

Thanks.

emartinezVic · March 6, 2018, 9:56am

When you use pre-trained embeddings, you load them when initializing the encoder/decoder, and their value will be updated during training as long as you do not set the -fix_word_vecs_dec or -fix_word_vecs_enc options to true.

matt · March 24, 2018, 11:26am

When following the above guide, the following error occurs:

IOError: [Errno 2] No such file or directory: ‘…/onmt_merge/sorted_tokens//train.src.txt’

And indeed there is no file there, even when manually fixing the two forward slashes.

Any suggestions?

Anis · March 28, 2018, 8:25pm

Hello @pltrdy ,
Doing Translation from language 1 to language 2 suppose that you use word vectors for language 1 and word vectors for language 2, however you are using the same Glove file (containing words of only one language) to get word vectors of the vocab of the both languages which does not look correct. Am I missing something ?

pltrdy · March 29, 2018, 9:44am

That’s correct, the script should allow to pass two embedding files. In fact, I’m working in summarization = both encoder and decoder are the same language.

Do not hesitate to suggest a pull request if you fix it.

Thanks.

Anis · March 29, 2018, 1:42pm

Ah, ok. thanks, so now it makes sense