How to use GloVe pre-trained embeddings in OpenNMT-py

Yes. Tensors does not contain text but integer values (long) corresponding to each word’s vocabulary id.

Hey! I was wondering what happens when one does the translation from say German->English and wants to use GloVe word embeddings. I couldn’t find pretrained GloVe word embeddings for German language.

Thanks

If I only want to use English word embeddings, how to load it for my custom vocabulary. Say I have a vocabulary of 10000 English words, how to load this weight to initialize Embedding weight?

@anand there is no GloVe pre-trained embeddigs for German as far as I know.

@xiadingZ that’s exactly the purpose of this tutorial. Following it step by step should do it. You can specify the vocabulary size in preprocess by using the flags: see https://github.com/OpenNMT/OpenNMT-py/blob/master/opts.py#L155

preprocess.py requires -train_tgt, -valid_src and so on…
If I only have a captions.txt, which contains captions line by line. I want to process it to a vocab(index to word mapping or word to index mapping) and corresponding Embedding weight, how to process it? can you give me a example or which opts should I set?

Hi pltrdy. Thank you for your contribution. Btw, I am wondering how about the translating part? Can we also make use of the embeddings?

Hmm, I’m not sure to get your point.

The word embeddings vectors are part of the model. Using GloVe for translation does not really makes sense, the model needs to be trained with it.

I am getting the following error, any idea how to fix it please?

‘’’
(cheny) [cheny@elgato-login OpenNMT-py]$ ./tools/embeddings_to_torch.py -emb_file “/extra/cheny/glove.840B.300d.txt” -dict_file “/extra/cheny/gpu.vocab.pt” -output_file "data/grammar_checker/embeddings"
From: /extra/cheny/gpu.vocab.pt
* source vocab: 50002 words
* target vocab: 50004 words
Traceback (most recent call last):
File “./tools/embeddings_to_torch.py”, line 94, in
main()
File “./tools/embeddings_to_torch.py”, line 63, in main
embeddings = get_embeddings(opt.emb_file)
File “./tools/embeddings_to_torch.py”, line 39, in get_embeddings
embs[l_split[0]] = [float(em) for em in l_split[1:]]
File “./tools/embeddings_to_torch.py”, line 39, in
embs[l_split[0]] = [float(em) for em in l_split[1:]]
ValueError: could not convert string to float: ‘.’

‘’’

I figure out what goes wrong now. It is caused by the bug of the pre-trained word-embedding vectors.
Let word-embedding vectors be ‘l’.
get_embeddings(file) assumes elements in l[1:] is numerical string (that can be ‘floated’). This is not always true. Many time l[1] or l[2] may be ‘.’.

> def get_embeddings(file):
>     embs = dict()
>     for l in open(file, 'rb').readlines():
>         l_split = l.decode('utf8').strip().split()
>         if len(l_split) == 2:
>             continue
>         embs[l_split[0]] = [float(em) for em in l_split[1:]]
>     print("Got {} embeddings from {}".format(len(embs), file))
> 
>     return embs

What are the best strategies to deal with this error? Fix the word-embedding file separately, or define extra steps in get_embeddings(file) to detect and fix or ignore the error vector on the fly?

Hmm interesting. I guess the best is to tweak the script, so that the initial file remains unchanged.

It would be perfect if you open a PR for this.

Thanks.

1 Like

@lucien0410 solved this by changing the unicode character used to split the characters in l_split = l.decode(‘utf8’).strip().split() make sure the embeddings file uses the same unicode character to separate the vector components.

@pltrdy can you check the issue I posted here? I even get good results after changing the provided embeddings_to_torch.py script?

@tyahmed
I don’t think it’s the unicode decoding problem. For example, the 52344th line of the embedding of ‘glove.840B.300d.txt’ (http://nlp.stanford.edu/data/glove.840B.300d.zip) is

‘. . .’ followed by 300 numbers.

After

l_split = l.decode(‘utf8’).strip().split()

is executed, l_split is

[’.’, ‘.’, ‘.’, ‘-0.1573’, ‘-0.29517’, ‘0.30453’, …

Now the length is not right (303 instead of 301), and the second and the third elements are ‘.’ that cannot be floated.

@pltrdy
I figure, we may find the number of the dimensions by counting how many of them can be floated.

def get_dimension_size(line):
	size=0
	l_split = line.decode('utf8').strip().split()
	for i in l_split:
		try: 
			_=float(i)
			size=size+1
		except:
			pass
	return size

Then the size indicates where the boundary between the word and the numbers should be:

def get_embeddings(file):
	embs = dict()
	firstLine=open(file,'r').readline()
	dimension=get_dimension_size(firstLine) # look at the first line to get the dimension 
	for l in open(file, 'rb').readlines():
		l_split = l.decode('utf8').strip().split()
		if len(l_split) == 2:
			continue
		emb=l_split[-1*dimension:] # use the dimension to mark the boundary 
		word=l_split[:-1*dimension]
		word=''.join(word)
		embs[word] = [float(em) for em in emb]
		print("Got {} embeddings from {}".format(len(embs), file))
	return embs

May not be the most elegant way to solve the problem, but it works …

Actually, I didn’t mean the encoding of the vectors but the unicode character by which the lines are split. For my case for instance, I simply changed l_split = l.decode(‘utf8’).strip().split() to l_split = l.decode('utf8').strip().split(' ') . It was for fastext embeddings.

1 Like

Oh! Your simple solution works!

It seems that python’s

str.split()

takes u’\xa0’ (Non-breaking space) as delimiters.

Hi, I would to confirm that

  1. The embedding is only used for initialization, and that its values will be updated during the training phase.
  2. Is it possible to keep the embedding values unchanged during training phase?

I notice the option -fix_word_vecs_dec, will this solve point #2 ?

Thanks.

When you use pre-trained embeddings, you load them when initializing the encoder/decoder, and their value will be updated during training as long as you do not set the -fix_word_vecs_dec or -fix_word_vecs_enc options to true.

When following the above guide, the following error occurs:

IOError: [Errno 2] No such file or directory: ‘…/onmt_merge/sorted_tokens//train.src.txt’

And indeed there is no file there, even when manually fixing the two forward slashes.

Any suggestions?

Hello @pltrdy ,
Doing Translation from language 1 to language 2 suppose that you use word vectors for language 1 and word vectors for language 2, however you are using the same Glove file (containing words of only one language) to get word vectors of the vocab of the both languages which does not look correct. Am I missing something ?

That’s correct, the script should allow to pass two embedding files. In fact, I’m working in summarization = both encoder and decoder are the same language.

Do not hesitate to suggest a pull request if you fix it.

Thanks.

Ah, ok. thanks, so now it makes sense