BPE question concerning choice of models

(Terence Lewis) #1

-bpe_model (default: ‘’)
Apply Byte Pair Encoding if the BPE model path is given. If the option is used, BPE related options will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua.<<

I ran learn_bpe.lua on both source & target to produce ned_codes,txt & eng_codes.txt respectively. I then tokenized training & validation data with these models and am currently training with this data. When it comes to inference which of these models (source or target) do I apply when I choose the option -bpe_model? Or have I strayed from the beaten track?

(Vincent Nguyen) #2

src one with src and tgt with tgt one.
usually what we do is to create a model with both SRC and TGT corpus combined so that there is only one model for both SRC and TGT.

it helps when some common words are shared.

(Terence Lewis) #3

src one with src and tgt with tgt one.
Thanks. Yes, I did that prior to training which now now going on.
But just so it’s clear in my mind. When it comes to inference (translation) I should create one bpe model from source & target and refer to my merged bpe model when using rest_translation_server.lua?

(Vincent Nguyen) #4

if you created 2 bpe models (one from source corpus, one from target corpus) for the training, then you won’t be able I think to use another model made of the concatenated corpus (src + tgt).
You have to use the bpe model made of the src corpus.

but yes, if you do one single model (from src + tgt) and train with it, then at inference, if you pass the model as an option to rest_translation_server, it will be used to tokenize the input.

detokenization is not model related, it just joins the pieces (if I am not wrong).

(jean.senellart) #5

it is correct!