-bpe_model (default: ‘’)
Apply Byte Pair Encoding if the BPE model path is given. If the option is used, BPE related options will be overridden/set automatically if the BPE model specified by -bpe_model is learnt using learn_bpe.lua.<<
I ran learn_bpe.lua on both source & target to produce ned_codes,txt & eng_codes.txt respectively. I then tokenized training & validation data with these models and am currently training with this data. When it comes to inference which of these models (source or target) do I apply when I choose the option -bpe_model? Or have I strayed from the beaten track?
src one with src and tgt with tgt one.
usually what we do is to create a model with both SRC and TGT corpus combined so that there is only one model for both SRC and TGT.
it helps when some common words are shared.
if you created 2 bpe models (one from source corpus, one from target corpus) for the training, then you won’t be able I think to use another model made of the concatenated corpus (src + tgt).
You have to use the bpe model made of the src corpus.
but yes, if you do one single model (from src + tgt) and train with it, then at inference, if you pass the model as an option to rest_translation_server, it will be used to tokenize the input.
detokenization is not model related, it just joins the pieces (if I am not wrong).