OpenNMT Forum

Correct pipeline for NMT using BPEmb

I want to use a pretrained BPEmb model. Here I use SentencePiece and create training.en.vocab and which I used as vocabulary files.
I have downloaded W2V files and included as source and target embedding files.

Is this correct pipeline for data preprocessing and training?

OS: Windows 10

model_dir: run/


    train_features_file: TrainEn.txt

     train_labels_file: trainBn.txt

     eval_features_file: trainDevEn.txt

     eval_labels_file: trainDevBn.txt

     source_vocabulary: training.en.vocab



    max_step: 5000

     batch_size: 40



      with_header: False

     case_insensitive: True



      with_header: False

I believe you should use BPEmb to tokenize your data, not SentencePiece.

Also source_embedding and target_embedding should be in the data section, as presented in the documentation:

How can I use BPEmb for tokenize data?
actually I am confused about the pipeline for data preprocessing using BPEmb

There are examples in their README. If you need more help, you probably want to ask questions on the BPEmb repository.