Canʻt get past Sentencepiece subword tokenization with pretrained embeddings

HURIMOZ · April 14, 2024, 6:24am

Hi, Iʻm building a bilingual translation model (Transformer) with SentencePiece subword tokenization for both source and target data, and with subword pretrained embeddings for the source data.

The system will not be happy with a simple command like onmt_train -config config.yaml as it will throw this error:
onmt_train: error: the following arguments are required: -src_vocab/–src_vocab even though the config file is correctly pointing at those vocab files generated by sentencepiece.

So I try command onmt_train -config config.yaml -src_vocab data/src_spm.vocab -tgt_vocab data/tgt_spm.vocab -gpu_ranks 0 but then I get this error: AssertionError: -save_data should be set if use pretrained embeddings

So I tried to rebuild vocabularies with the onmt_build_vocab module: onmt_build_vocab -config config.yaml -n_sample 80000 -save_data data/processed -src_vocab data/src_spm.vocab -tgt_vocab data/tgt_spm.vocab but again I got stuck with this error: raise IOError(f"path {path} exists, stop.")
OSError: path data/src_spm.vocab exists, stop.

Given that I have already trained my SentencePiece models and vocabs I should not need to run onmt_build_vocab to create separate vocabulary files again. The SentencePiece models (src_spm.model and tgt_spm.model) and their corresponding vocabularies (src_spm.vocab and tgt_spm.vocab) should suffice for training, right?

Any help welcome!
Thank you,
Tamatoa