Hi, Iʻm building a bilingual translation model (Transformer) with SentencePiece subword tokenization for both source and target data, and with subword pretrained embeddings for the source data.
The system will not be happy with a simple command like onmt_train -config config.yaml as it will throw this error:
onmt_train: error: the following arguments are required: -src_vocab/–src_vocab even though the config file is correctly pointing at those vocab files generated by sentencepiece.
So I try command onmt_train -config config.yaml -src_vocab data/src_spm.vocab -tgt_vocab data/tgt_spm.vocab -gpu_ranks 0 but then I get this error: AssertionError: -save_data should be set if use pretrained embeddings
So I tried to rebuild vocabularies with the onmt_build_vocab module: onmt_build_vocab -config config.yaml -n_sample 80000 -save_data data/processed -src_vocab data/src_spm.vocab -tgt_vocab data/tgt_spm.vocab but again I got stuck with this error: raise IOError(f"path {path} exists, stop.")
OSError: path data/src_spm.vocab exists, stop.
Given that I have already trained my SentencePiece models and vocabs I should not need to run onmt_build_vocab
to create separate vocabulary files again. The SentencePiece models (src_spm.model
and tgt_spm.model
) and their corresponding vocabularies (src_spm.vocab
and tgt_spm.vocab
) should suffice for training, right?
Any help welcome!
Thank you,
Tamatoa