How to use bpe to train on customer data

in general, how should I use bpe to train a transformer?
in my opinion, I can first learn bpe model using other tool, and then tokenize the training data using bpe model. and then use onmt-build-vocab to build vocab on my own data set. and then use this vocab to with onmt-main to train model?

is there something wrong with my method?

or do you have examples of how to combine bpe to customer dataset?

Yes, this approach is working.

When you get new data, it is OK to tokenize it with the same BPE model and use the same vocabulary.

1 Like