How to use GloVe pre-trained embeddings in OpenNMT-py

pltrdy · January 10, 2018, 11:22am

It’s ok if you recieved 3 files, as I stated.

Preprocessing construct vocabularies (source, and target), and create numerical representation of source/target by mapping words with the corresponding vocabulary id.

Then, the train.pt (resp valid.pt) contain tensors that represents both source and target training (resp. valid) sequences. The vocabularies are built from the training dataset, and stored in vocab.pt.

caozhen-alex · January 11, 2018, 3:34am

Thx for the detail explanation, pltrdy. But when you say “the train.pt (resp valid.pt) contain tensors that represents both source and target training (resp. valid) sequences”. Tensors here mean the vocabulary id， right?

pltrdy · January 11, 2018, 5:58pm

Yes. Tensors does not contain text but integer values (long) corresponding to each word’s vocabulary id.

anand · January 18, 2018, 7:21pm

Hey! I was wondering what happens when one does the translation from say German->English and wants to use GloVe word embeddings. I couldn’t find pretrained GloVe word embeddings for German language.

Thanks

xiadingZ · January 19, 2018, 10:54am

If I only want to use English word embeddings, how to load it for my custom vocabulary. Say I have a vocabulary of 10000 English words, how to load this weight to initialize Embedding weight?

pltrdy · January 19, 2018, 1:39pm

@anand there is no GloVe pre-trained embeddigs for German as far as I know.

@xiadingZ that’s exactly the purpose of this tutorial. Following it step by step should do it. You can specify the vocabulary size in preprocess by using the flags: see https://github.com/OpenNMT/OpenNMT-py/blob/master/opts.py#L155

xiadingZ · January 20, 2018, 3:12am

preprocess.py requires -train_tgt, -valid_src and so on…
If I only have a captions.txt, which contains captions line by line. I want to process it to a vocab(index to word mapping or word to index mapping) and corresponding Embedding weight, how to process it? can you give me a example or which opts should I set?

caozhen-alex · February 2, 2018, 9:39am

Hi pltrdy. Thank you for your contribution. Btw, I am wondering how about the translating part? Can we also make use of the embeddings?

pltrdy · February 4, 2018, 6:29pm

Hmm, I’m not sure to get your point.

The word embeddings vectors are part of the model. Using GloVe for translation does not really makes sense, the model needs to be trained with it.

lucien0410 · February 17, 2018, 6:44am

I am getting the following error, any idea how to fix it please?

‘’’
(cheny) [cheny@elgato-login OpenNMT-py]$ ./tools/embeddings_to_torch.py -emb_file “/extra/cheny/glove.840B.300d.txt” -dict_file “/extra/cheny/gpu.vocab.pt” -output_file "data/grammar_checker/embeddings"
From: /extra/cheny/gpu.vocab.pt
* source vocab: 50002 words
* target vocab: 50004 words
Traceback (most recent call last):
File “./tools/embeddings_to_torch.py”, line 94, in
main()
File “./tools/embeddings_to_torch.py”, line 63, in main
embeddings = get_embeddings(opt.emb_file)
File “./tools/embeddings_to_torch.py”, line 39, in get_embeddings
embs[l_split[0]] = [float(em) for em in l_split[1:]]
File “./tools/embeddings_to_torch.py”, line 39, in
embs[l_split[0]] = [float(em) for em in l_split[1:]]
ValueError: could not convert string to float: ‘.’

‘’’

lucien0410 · February 17, 2018, 8:01pm

I figure out what goes wrong now. It is caused by the bug of the pre-trained word-embedding vectors.
Let word-embedding vectors be ‘l’.
get_embeddings(file) assumes elements in l[1:] is numerical string (that can be ‘floated’). This is not always true. Many time l[1] or l[2] may be ‘.’.

> def get_embeddings(file):
>     embs = dict()
>     for l in open(file, 'rb').readlines():
>         l_split = l.decode('utf8').strip().split()
>         if len(l_split) == 2:
>             continue
>         embs[l_split[0]] = [float(em) for em in l_split[1:]]
>     print("Got {} embeddings from {}".format(len(embs), file))
> 
>     return embs

What are the best strategies to deal with this error? Fix the word-embedding file separately, or define extra steps in get_embeddings(file) to detect and fix or ignore the error vector on the fly?

pltrdy · February 26, 2018, 10:56am

Hmm interesting. I guess the best is to tweak the script, so that the initial file remains unchanged.

It would be perfect if you open a PR for this.

Thanks.

tyahmed · February 28, 2018, 8:54am

@lucien0410 solved this by changing the unicode character used to split the characters in l_split = l.decode(‘utf8’).strip().split() make sure the embeddings file uses the same unicode character to separate the vector components.

@pltrdy can you check the issue I posted here? I even get good results after changing the provided embeddings_to_torch.py script?

lucien0410 · February 28, 2018, 11:27pm

@tyahmed
I don’t think it’s the unicode decoding problem. For example, the 52344th line of the embedding of ‘glove.840B.300d.txt’ (http://nlp.stanford.edu/data/glove.840B.300d.zip) is

‘. . .’ followed by 300 numbers.

After

l_split = l.decode(‘utf8’).strip().split()

is executed, l_split is

[’.’, ‘.’, ‘.’, ‘-0.1573’, ‘-0.29517’, ‘0.30453’, …

Now the length is not right (303 instead of 301), and the second and the third elements are ‘.’ that cannot be floated.

@pltrdy
I figure, we may find the number of the dimensions by counting how many of them can be floated.

def get_dimension_size(line):
	size=0
	l_split = line.decode('utf8').strip().split()
	for i in l_split:
		try: 
			_=float(i)
			size=size+1
		except:
			pass
	return size

Then the size indicates where the boundary between the word and the numbers should be:

def get_embeddings(file):
	embs = dict()
	firstLine=open(file,'r').readline()
	dimension=get_dimension_size(firstLine) # look at the first line to get the dimension 
	for l in open(file, 'rb').readlines():
		l_split = l.decode('utf8').strip().split()
		if len(l_split) == 2:
			continue
		emb=l_split[-1*dimension:] # use the dimension to mark the boundary 
		word=l_split[:-1*dimension]
		word=''.join(word)
		embs[word] = [float(em) for em in emb]
		print("Got {} embeddings from {}".format(len(embs), file))
	return embs

May not be the most elegant way to solve the problem, but it works …

tyahmed · March 1, 2018, 10:59am

Actually, I didn’t mean the encoding of the vectors but the unicode character by which the lines are split. For my case for instance, I simply changed l_split = l.decode(‘utf8’).strip().split() to l_split = l.decode('utf8').strip().split(' ') . It was for fastext embeddings.

lucien0410 · March 1, 2018, 11:57pm

Oh! Your simple solution works!

It seems that python’s

str.split()

takes u’\xa0’ (Non-breaking space) as delimiters.

datddd · March 6, 2018, 6:12am

Hi, I would to confirm that

The embedding is only used for initialization, and that its values will be updated during the training phase.
Is it possible to keep the embedding values unchanged during training phase?

I notice the option -fix_word_vecs_dec, will this solve point #2 ?

Thanks.

emartinezVic · March 6, 2018, 9:56am

When you use pre-trained embeddings, you load them when initializing the encoder/decoder, and their value will be updated during training as long as you do not set the -fix_word_vecs_dec or -fix_word_vecs_enc options to true.

matt · March 24, 2018, 11:26am

When following the above guide, the following error occurs:

IOError: [Errno 2] No such file or directory: ‘…/onmt_merge/sorted_tokens//train.src.txt’

And indeed there is no file there, even when manually fixing the two forward slashes.

Any suggestions?

Anis · March 28, 2018, 8:25pm

Hello @pltrdy ,
Doing Translation from language 1 to language 2 suppose that you use word vectors for language 1 and word vectors for language 2, however you are using the same Glove file (containing words of only one language) to get word vectors of the vocab of the both languages which does not look correct. Am I missing something ?