Now the length is not right (303 instead of 301), and the second and the third elements are ‘.’ that cannot be floated.
@pltrdy
I figure, we may find the number of the dimensions by counting how many of them can be floated.
def get_dimension_size(line):
size=0
l_split = line.decode('utf8').strip().split()
for i in l_split:
try:
_=float(i)
size=size+1
except:
pass
return size
Then the size indicates where the boundary between the word and the numbers should be:
def get_embeddings(file):
embs = dict()
firstLine=open(file,'r').readline()
dimension=get_dimension_size(firstLine) # look at the first line to get the dimension
for l in open(file, 'rb').readlines():
l_split = l.decode('utf8').strip().split()
if len(l_split) == 2:
continue
emb=l_split[-1*dimension:] # use the dimension to mark the boundary
word=l_split[:-1*dimension]
word=''.join(word)
embs[word] = [float(em) for em in emb]
print("Got {} embeddings from {}".format(len(embs), file))
return embs
May not be the most elegant way to solve the problem, but it works …
Actually, I didn’t mean the encoding of the vectors but the unicode character by which the lines are split. For my case for instance, I simply changed l_split = l.decode(‘utf8’).strip().split() to l_split = l.decode('utf8').strip().split(' ') . It was for fastext embeddings.
When you use pre-trained embeddings, you load them when initializing the encoder/decoder, and their value will be updated during training as long as you do not set the -fix_word_vecs_dec or -fix_word_vecs_enc options to true.
Hello @pltrdy ,
Doing Translation from language 1 to language 2 suppose that you use word vectors for language 1 and word vectors for language 2, however you are using the same Glove file (containing words of only one language) to get word vectors of the vocab of the both languages which does not look correct. Am I missing something ?
That’s correct, the script should allow to pass two embedding files. In fact, I’m working in summarization = both encoder and decoder are the same language.
Do not hesitate to suggest a pull request if you fix it.