OpenNMT Forum

How to change embeddings into torch serialized tensors (Glove embeddings)

I have some issues using Glove word embeddings.I generated the embeddings by following the author’s scripts and I know that they need to be torch serialized tensors if I want to use them in the train part. I found some tutorials in this forum where they use preprocess.py to generate the vocabulary file and then use embeddings_to_torch.py to produce the tensors. My problem is that I couldn’t find preprocess.py in the github repository (https://github.com/OpenNMT/OpenNMT-py) and I have no idea how to change the embeddings into tensors. Can someone help me to understand how to do this?

preprocess.py is no longer since 2.0 release
Did you have a look at the updated entry in the FAQ?

So how can I create the tensors if preprocess.py doesn’t exist anymore?

It’s done on the fly here: https://github.com/OpenNMT/OpenNMT-py/blob/fa34132067aeb50339843d1a08f79e6597da3e32/onmt/bin/train.py#L37

Can you explain to me how to perform this step by step? I have my word embeddings in a .txt file, what do I do from here? Sorry for the stupid question, but I’m very new to this.

The steps are documented in the link I already provided: https://opennmt.net/OpenNMT-py/FAQ.html

I managed to download OpenNMT-py 2.0, but I ran into another problem. This is the trace:

Traceback (most recent call last):
File “/…/…/envNMT/bin/onmt_train”, line 8, in
sys.exit(main())
File “/…/…/envNMT/lib/python3.7/site-packages/onmt/bin/train.py”, line 169, in main
train(opt)
File “/…/…/envNMT/lib/python3.7/site-packages/onmt/bin/train.py”, line 103, in train
checkpoint, fields, transforms_cls = _init_train(opt)
File “/…/…/envNMT/lib/python3.7/site-packages/onmt/bin/train.py”, line 58, in _init_train
ArgumentParser.validate_prepare_opts(opt)
File “/…/…/envNMT/lib/python3.7/site-packages/onmt/utils/parse.py”, line 131, in validate_prepare_opts
cls._validate_data(opt)
File “/…/…/envNMT/lib/python3.7/site-packages/onmt/utils/parse.py”, line 29, in _validate_data
for cname, corpus in corpora.items():
AttributeError: ‘str’ object has no attribute ‘items’

Can someone help me to understand what went wrong?

You need to make a valid yaml configuration, as shown in the docs: https://opennmt.net/OpenNMT-py/quickstart.html#step-1-prepare-the-data