Iʻm bumping up this topic as I donʻt find any solid help to answer this question.
Iʻm struggling to get decent results when training with pretrained subword embeddings (BPEmb).
I think, although small, my data is of very good quality and I might be doing something wrong along the way.
See below my config file.
For source side I am leveraging the BPEmb subword embeddings truncated at 256 dimensions with the original vocab file that was built with the same tokenizer model, and I also used the tokenizer model to tokenize my entire src-train.fr corpus and tokenize the src-val.fr validation corpus too. For the target side, I have built a Sentencepiece subword Unigram tokenization model and vocab file.
I also tried to freeze the embeddings on the encoder side but to no avail.
When I run onmt_train -config config-transformer-base-1GPU.yaml -gpu_ranks 0 -freeze_word_vecs_enc I see that the system is not using the 200,000 vocabs but only 37,000 vocabs for the source. Where and how does it find those 37,000 vocabs? The 200,000 vocab list all have their embeddings, so why is it only using 37,000?
The OpenNMT-py documentation is really not user friendly and lacks case studies with subword pretrained embeddings.
Can you please provide some guidance?
Configuration file for OpenNMT-py training and translation:
Blockquote
Model architecture configuration
encoder_type: transformer
decoder_type: transformer
position_encoding: true
layers: 6
hidden_size: 256
heads: 8
transformer_ff: 2048
dropout_steps: [0]
dropout: [0.3]
attention_dropout: [0.3]
lora_dropout: 0.3
Pretrained embeddings configuration for the source language
src_embeddings: data/fr.wiki.bpe.vs200000.d300.w2v-256.txt # Ensure this path is correct
embeddings_type: word2vec # Ensure this matches the format of your embeddings
word_vec_size: 256
Optimization
optim: adam
adam_beta1: 0.9
adam_beta2: 0.998
decay_method: noam
learning_rate: 2.0
max_grad_norm: 0.0
normalization: tokens
param_init: 0.0
param_init_glorot: true
#position_encoding: false
#max_relative_positions: 20
model_dtype: “fp16”
Batching
batch_size: 2048
batch_type: tokens
accum_count: 8
max_generator_batches: 2
Tokenization options
#src_subword_type: bpe # Specify the tokenization method for the source side
#tgt_subword_type: sentencepiece # Specify the tokenization method for the target side
src_subword_model: data/fr.wiki.bpe.vs200000.model # Path to the BPEmb model
tgt_subword_model: data/tgt_spm.model # Path to the SentencePiece model
src_vocab: data/fr.wiki.bpe.vs200000.onmt_vocab # Path to the source vocabulary
tgt_vocab: data/tgt_spm.onmt_vocab # Path to the target vocabulary
Training hyperparameters
save_model: run/model
keep_checkpoint: 20
save_checkpoint_steps: 1000
seed: -1
train_steps: 100000
valid_steps: 500
warmup_steps: 8000
report_every: 500
early_stopping: 5
early_stopping_criteria: accuracy
TensorBoard configuration
tensorboard: true
tensorboard_log_dir: run/logs
Error handling
on_error: raise
Train on a single GPU
world_size: 1
gpu_ranks: [0]
Path for saving data required by pretrained embeddings
save_data: data/processed
Corpus opts:
data:
corpus_1:
path_src: src-train.fr
path_tgt: tgt-train.ty
transforms: [normalize, sentencepiece, filtertoolong]
weight: 1
src_lang: fr
tgt_lang: ty
norm_quote_commas: true
norm_numbers: true
valid:
path_src: src-val.fr
path_tgt: tgt-val.ty
transforms: [normalize, sentencepiece, filtertoolong]
src_lang: fr
tgt_lang: ty
norm_quote_commas: true
norm_numbers: true`