OpenNMT Forum

How to use pre-trained BPEmb subword embeddings with latest versions of OpenNMT and OpenNMT-py?

Here is a link to BPEmb: GitHub - bheinzerling/bpemb: Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)

I tried searching for a solution on the internet they only apply to older versions of openNMT and don’t work for latest versions. Also, I am having trouble understanding the documentation. Providing concrete examples will be extremly helpful.

Thanks in advance!

I am not very familiar with this BPEmb code, but I guess you could export/convert these embs to GloVe or word2vec format which are formats supported as pretrained embeddings in OpenNMT-py.

Not sure which doc you are talking about, but there actually is a concrete example here:

Thanks for the response!

This is the documentation I am referring (should have specified it above). I tried based on what was specified in this doc after converting to word2vec format but vocabulary size was two for some reason. Also, bpemb utilizes a sentence piece model for performing subword tokenization (the example in the doc is based on word level tokenization). So should I perform subword encoding separately using bpemb then perform embedding using bpemb?

In previous versions python script OpenNMT-py/ was used. This is not found in latest versions.

The preprocessing step is no longer necessary since v2. (OpenNMT-py 2.0 release)
The doc in question was updated to reflect that.

Subword vs word actually doesn’t matter much. If you use subword tokenization and pass subword pretrained embeddings it will work exactly the same as word tokenization and word pretrained embeddings.
If BPEmb requires a specific sentencepiece model, then you need to use this one. See this entry for on the fly tokenization.

Maybe you should try some easier setup without BPEmb to get started and get your head around how it all works.

Thanks once again! I have figured it out without bpemb, I think you have cleared the concepts to me. Will try and let you know.