How to use GloVe pre-trained embeddings in OpenNMT-py

I’m reffering to a script that is not merged yet, see PR#398


Using vocabularies from OpenNMT-py preprocessing outputs, embeddings_to_torch.py generate encoder and decoder embeddings initialized with GloVe’s values.

the script is a slightly modified version of ylhsieh’s one.

Usage:

usage: embeddings_to_torch.py [-h] -emb_file EMB_FILE -output_file OUTPUT_FILE
                              -dict_file DICT_FILE [-verbose]
  • emb_file: GloVe like embedding file i.e. CSV [word] [dim1] ... [dim_d]
  • output_file: a filename to save the output as PyTorch serialized tensors
  • dict_file: dict output from OpenNMT-py preprocessing

Example

0) set some variables:

export data="../onmt_merge/sorted_tokens/"
export root="./glove_experiment"
export glove_dir="./glove"

1) get GloVe files:

mkdir "$glove_dir"
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6B.zip -d "$glove_dir"

2) prepare data:

  mkdir -p $root
  python preprocess.py \
      -train_src $data/train.src.txt \
      -train_tgt $data/train.tgt.txt \
      -valid_src $data/valid.src.txt \
      -valid_tgt $data/valid.tgt.txt \
      -save_data $root/data

3) prepare embeddings:

  ./tools/embeddings_to_torch.py -emb_file "$glove_dir/glove.6B.100d.txt" \
                                 -dict_file "$root/data.vocab.pt" \
                                 -output_file "$root/embeddings" 

4) train using pre-trained embeddings:

  python train.py -save_model $root/model \
        -batch_size 64 \
        -layers 2 \ 
        -rnn_size 200 \
        -word_vec_size 100 \
        -pre_word_vecs_enc "$root/embeddings.enc.pt" \
        -pre_word_vecs_dec "$root/embeddings.dec.pt" \
        -data $root/data
3 Likes

Thank you for the tutorial!
I wanted to know what happens to words that are not found in the pre-trained embeddings? Are they considered as OOV/unk when training?

Not really. Word that aren’t found in pre-trained embeddings just wont be initialized i.e. 0 valued tensor.

In fact, I’m not sure what a different initialization would do. You should then let the model train the embedding (not fixing it) in order to fill the gaps.

2 Likes

Thank you for the tutorial. On Step 3, you set the dictionary file as a parameter which ends in a ‘.pt’. After I run the preprocess.py script, I receive two dictionary files as output ending in ‘.dict’. These files have the following format:
you 4
the 5
to 6
a 7
of 8
I 9
that 10

What is the format of the lines within ‘data.vocab.pt’? Is there a way I can derive that file from the two .dict files I have?
Thanks

The vocab.pt file I’m reffering to is an output from the pre-processing step (2).


(from OpenNMT-py Quickstart)

Hi, why I just received three .pt files just as the Readme.md shows.

Hi pltrdy, I am quite interesting in the .pt files generated by preprocess.py. What is the content of those files? What does training data mean here?

It’s ok if you recieved 3 files, as I stated.

Preprocessing construct vocabularies (source, and target), and create numerical representation of source/target by mapping words with the corresponding vocabulary id.

Then, the train.pt (resp valid.pt) contain tensors that represents both source and target training (resp. valid) sequences. The vocabularies are built from the training dataset, and stored in vocab.pt.

Thx for the detail explanation, pltrdy. But when you say “the train.pt (resp valid.pt) contain tensors that represents both source and target training (resp. valid) sequences”. Tensors here mean the vocabulary id, right?

Yes. Tensors does not contain text but integer values (long) corresponding to each word’s vocabulary id.

Hey! I was wondering what happens when one does the translation from say German->English and wants to use GloVe word embeddings. I couldn’t find pretrained GloVe word embeddings for German language.

Thanks

If I only want to use English word embeddings, how to load it for my custom vocabulary. Say I have a vocabulary of 10000 English words, how to load this weight to initialize Embedding weight?

@anand there is no GloVe pre-trained embeddigs for German as far as I know.

@xiadingZ that’s exactly the purpose of this tutorial. Following it step by step should do it. You can specify the vocabulary size in preprocess by using the flags: see https://github.com/OpenNMT/OpenNMT-py/blob/master/opts.py#L155

preprocess.py requires -train_tgt, -valid_src and so on…
If I only have a captions.txt, which contains captions line by line. I want to process it to a vocab(index to word mapping or word to index mapping) and corresponding Embedding weight, how to process it? can you give me a example or which opts should I set?

Hi pltrdy. Thank you for your contribution. Btw, I am wondering how about the translating part? Can we also make use of the embeddings?

Hmm, I’m not sure to get your point.

The word embeddings vectors are part of the model. Using GloVe for translation does not really makes sense, the model needs to be trained with it.

I am getting the following error, any idea how to fix it please?

‘’’
(cheny) [cheny@elgato-login OpenNMT-py]$ ./tools/embeddings_to_torch.py -emb_file “/extra/cheny/glove.840B.300d.txt” -dict_file “/extra/cheny/gpu.vocab.pt” -output_file "data/grammar_checker/embeddings"
From: /extra/cheny/gpu.vocab.pt
* source vocab: 50002 words
* target vocab: 50004 words
Traceback (most recent call last):
File “./tools/embeddings_to_torch.py”, line 94, in
main()
File “./tools/embeddings_to_torch.py”, line 63, in main
embeddings = get_embeddings(opt.emb_file)
File “./tools/embeddings_to_torch.py”, line 39, in get_embeddings
embs[l_split[0]] = [float(em) for em in l_split[1:]]
File “./tools/embeddings_to_torch.py”, line 39, in
embs[l_split[0]] = [float(em) for em in l_split[1:]]
ValueError: could not convert string to float: ‘.’

‘’’

I figure out what goes wrong now. It is caused by the bug of the pre-trained word-embedding vectors.
Let word-embedding vectors be ‘l’.
get_embeddings(file) assumes elements in l[1:] is numerical string (that can be ‘floated’). This is not always true. Many time l[1] or l[2] may be ‘.’.

> def get_embeddings(file):
>     embs = dict()
>     for l in open(file, 'rb').readlines():
>         l_split = l.decode('utf8').strip().split()
>         if len(l_split) == 2:
>             continue
>         embs[l_split[0]] = [float(em) for em in l_split[1:]]
>     print("Got {} embeddings from {}".format(len(embs), file))
> 
>     return embs

What are the best strategies to deal with this error? Fix the word-embedding file separately, or define extra steps in get_embeddings(file) to detect and fix or ignore the error vector on the fly?

Hmm interesting. I guess the best is to tweak the script, so that the initial file remains unchanged.

It would be perfect if you open a PR for this.

Thanks.

1 Like

@lucien0410 solved this by changing the unicode character used to split the characters in l_split = l.decode(‘utf8’).strip().split() make sure the embeddings file uses the same unicode character to separate the vector components.

@pltrdy can you check the issue I posted here? I even get good results after changing the provided embeddings_to_torch.py script?