How to use pre-trained BPEmb subword embeddings with latest versions of OpenNMT and OpenNMT-py?

Here is a link to BPEmb: GitHub - bheinzerling/bpemb: Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)

I tried searching for a solution on the internet they only apply to older versions of openNMT and don’t work for latest versions. Also, I am having trouble understanding the documentation. Providing concrete examples will be extremly helpful.

Thanks in advance!

I am not very familiar with this BPEmb code, but I guess you could export/convert these embs to GloVe or word2vec format which are formats supported as pretrained embeddings in OpenNMT-py.

Not sure which doc you are talking about, but there actually is a concrete example here:
https://opennmt.net/OpenNMT-py/FAQ.html#how-do-i-use-pretrained-embeddings-e-g-glove

Thanks for the response!

This is the documentation I am referring (should have specified it above). I tried based on what was specified in this doc after converting to word2vec format but vocabulary size was two for some reason. Also, bpemb utilizes a sentence piece model for performing subword tokenization (the example in the doc is based on word level tokenization). So should I perform subword encoding separately using bpemb then perform embedding using bpemb?

In previous versions python script OpenNMT-py/preprocess.py was used. This is not found in latest versions.

The preprocessing step is no longer necessary since v2. (OpenNMT-py 2.0 release)
The doc in question was updated to reflect that.

Subword vs word actually doesn’t matter much. If you use subword tokenization and pass subword pretrained embeddings it will work exactly the same as word tokenization and word pretrained embeddings.
If BPEmb requires a specific sentencepiece model, then you need to use this one. See this entry for on the fly tokenization.

Maybe you should try some easier setup without BPEmb to get started and get your head around how it all works.

Thanks once again! I have figured it out without bpemb, I think you have cleared the concepts to me. Will try and let you know.

Iʻm bumping up this topic as I donʻt find any solid help to answer this question.

Iʻm struggling to get decent results when training with pretrained subword embeddings (BPEmb).
I think, although small, my data is of very good quality and I might be doing something wrong along the way.
See below my config file.
For source side I am leveraging the BPEmb subword embeddings truncated at 256 dimensions with the original vocab file that was built with the same tokenizer model, and I also used the tokenizer model to tokenize my entire src-train.fr corpus and tokenize the src-val.fr validation corpus too. For the target side, I have built a Sentencepiece subword Unigram tokenization model and vocab file.
I also tried to freeze the embeddings on the encoder side but to no avail.
When I run onmt_train -config config-transformer-base-1GPU.yaml -gpu_ranks 0 -freeze_word_vecs_enc I see that the system is not using the 200,000 vocabs but only 37,000 vocabs for the source. Where and how does it find those 37,000 vocabs? The 200,000 vocab list all have their embeddings, so why is it only using 37,000?

The OpenNMT-py documentation is really not user friendly and lacks case studies with subword pretrained embeddings.

Can you please provide some guidance?

Configuration file for OpenNMT-py training and translation:

Blockquote

Model architecture configuration

encoder_type: transformer
decoder_type: transformer
position_encoding: true
layers: 6
hidden_size: 256
heads: 8
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.3]
attention_dropout: [0.3]
lora_dropout: 0.3

Pretrained embeddings configuration for the source language

src_embeddings: data/fr.wiki.bpe.vs200000.d300.w2v-256.txt # Ensure this path is correct
embeddings_type: word2vec # Ensure this matches the format of your embeddings
word_vec_size: 256

Optimization

optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0
normalization: tokens
param_init: 0.0
param_init_glorot: true
#position_encoding: false
#max_relative_positions: 20
model_dtype: “fp16”

Batching

batch_size: 2048
batch_type: tokens
accum_count: 8
max_generator_batches: 2

Tokenization options

#src_subword_type: bpe # Specify the tokenization method for the source side
#tgt_subword_type: sentencepiece # Specify the tokenization method for the target side
src_subword_model: data/fr.wiki.bpe.vs200000.model # Path to the BPEmb model
tgt_subword_model: data/tgt_spm.model # Path to the SentencePiece model
src_vocab: data/fr.wiki.bpe.vs200000.onmt_vocab # Path to the source vocabulary
tgt_vocab: data/tgt_spm.onmt_vocab # Path to the target vocabulary

Training hyperparameters

save_model: run/model
keep_checkpoint: 20
save_checkpoint_steps: 1000
seed: -1
train_steps: 100000
valid_steps: 500
warmup_steps: 8000
report_every: 500
early_stopping: 5
early_stopping_criteria: accuracy

TensorBoard configuration

tensorboard: true
tensorboard_log_dir: run/logs

Error handling

on_error: raise

Train on a single GPU

world_size: 1
gpu_ranks: [0]

Path for saving data required by pretrained embeddings

save_data: data/processed

Corpus opts:

data:
corpus_1:
path_src: src-train.fr
path_tgt: tgt-train.ty
transforms: [normalize, sentencepiece, filtertoolong]
weight: 1
src_lang: fr
tgt_lang: ty
norm_quote_commas: true
norm_numbers: true
valid:
path_src: src-val.fr
path_tgt: tgt-val.ty
transforms: [normalize, sentencepiece, filtertoolong]
src_lang: fr
tgt_lang: ty
norm_quote_commas: true
norm_numbers: true`