Preprocessing construct vocabularies (source, and target), and create numerical representation of source/target by mapping words with the corresponding vocabulary id.
Then, the train.pt (resp valid.pt) contain tensors that represents both source and target training (resp. valid) sequences. The vocabularies are built from the training dataset, and stored in vocab.pt.
Thx for the detail explanation, pltrdy. But when you say “the train.pt (resp valid.pt) contain tensors that represents both source and target training (resp. valid) sequences”. Tensors here mean the vocabulary id, right?
Hey! I was wondering what happens when one does the translation from say German->English and wants to use GloVe word embeddings. I couldn’t find pretrained GloVe word embeddings for German language.
If I only want to use English word embeddings, how to load it for my custom vocabulary. Say I have a vocabulary of 10000 English words, how to load this weight to initialize Embedding weight?
preprocess.py requires -train_tgt, -valid_src and so on…
If I only have a captions.txt, which contains captions line by line. I want to process it to a vocab(index to word mapping or word to index mapping) and corresponding Embedding weight, how to process it? can you give me a example or which opts should I set?
I am getting the following error, any idea how to fix it please?
‘’’
(cheny) [cheny@elgato-login OpenNMT-py]$ ./tools/embeddings_to_torch.py -emb_file “/extra/cheny/glove.840B.300d.txt” -dict_file “/extra/cheny/gpu.vocab.pt” -output_file "data/grammar_checker/embeddings"
From: /extra/cheny/gpu.vocab.pt
* source vocab: 50002 words
* target vocab: 50004 words
Traceback (most recent call last):
File “./tools/embeddings_to_torch.py”, line 94, in
main()
File “./tools/embeddings_to_torch.py”, line 63, in main
embeddings = get_embeddings(opt.emb_file)
File “./tools/embeddings_to_torch.py”, line 39, in get_embeddings
embs[l_split[0]] = [float(em) for em in l_split[1:]]
File “./tools/embeddings_to_torch.py”, line 39, in
embs[l_split[0]] = [float(em) for em in l_split[1:]]
ValueError: could not convert string to float: ‘.’
I figure out what goes wrong now. It is caused by the bug of the pre-trained word-embedding vectors.
Let word-embedding vectors be ‘l’.
get_embeddings(file) assumes elements in l[1:] is numerical string (that can be ‘floated’). This is not always true. Many time l[1] or l[2] may be ‘.’.
> def get_embeddings(file):
> embs = dict()
> for l in open(file, 'rb').readlines():
> l_split = l.decode('utf8').strip().split()
> if len(l_split) == 2:
> continue
> embs[l_split[0]] = [float(em) for em in l_split[1:]]
> print("Got {} embeddings from {}".format(len(embs), file))
>
> return embs
What are the best strategies to deal with this error? Fix the word-embedding file separately, or define extra steps in get_embeddings(file) to detect and fix or ignore the error vector on the fly?
@lucien0410 solved this by changing the unicode character used to split the characters in l_split = l.decode(‘utf8’).strip().split() make sure the embeddings file uses the same unicode character to separate the vector components.
@pltrdy can you check the issue I posted here? I even get good results after changing the provided embeddings_to_torch.py script?
Now the length is not right (303 instead of 301), and the second and the third elements are ‘.’ that cannot be floated.
@pltrdy
I figure, we may find the number of the dimensions by counting how many of them can be floated.
def get_dimension_size(line):
size=0
l_split = line.decode('utf8').strip().split()
for i in l_split:
try:
_=float(i)
size=size+1
except:
pass
return size
Then the size indicates where the boundary between the word and the numbers should be:
def get_embeddings(file):
embs = dict()
firstLine=open(file,'r').readline()
dimension=get_dimension_size(firstLine) # look at the first line to get the dimension
for l in open(file, 'rb').readlines():
l_split = l.decode('utf8').strip().split()
if len(l_split) == 2:
continue
emb=l_split[-1*dimension:] # use the dimension to mark the boundary
word=l_split[:-1*dimension]
word=''.join(word)
embs[word] = [float(em) for em in emb]
print("Got {} embeddings from {}".format(len(embs), file))
return embs
May not be the most elegant way to solve the problem, but it works …
Actually, I didn’t mean the encoding of the vectors but the unicode character by which the lines are split. For my case for instance, I simply changed l_split = l.decode(‘utf8’).strip().split() to l_split = l.decode('utf8').strip().split(' ') . It was for fastext embeddings.
When you use pre-trained embeddings, you load them when initializing the encoder/decoder, and their value will be updated during training as long as you do not set the -fix_word_vecs_dec or -fix_word_vecs_enc options to true.
Hello @pltrdy ,
Doing Translation from language 1 to language 2 suppose that you use word vectors for language 1 and word vectors for language 2, however you are using the same Glove file (containing words of only one language) to get word vectors of the vocab of the both languages which does not look correct. Am I missing something ?