Correct pipeline for NMT using BPEmb

shan778 · June 17, 2020, 6:03am

I want to use a pretrained BPEmb model. Here I use SentencePiece and create training.en.vocab and training.bn.vocab which I used as vocabulary files.
I have downloaded W2V files and included as source and target embedding files.

Is this correct pipeline for data preprocessing and training?

OS: Windows 10

model_dir: run/

data:

    train_features_file: TrainEn.txt

     train_labels_file: trainBn.txt

     eval_features_file: trainDevEn.txt

     eval_labels_file: trainDevBn.txt

     source_vocabulary: training.en.vocab

     target_vocabulary: training.bn.vocab

train:

    max_step: 5000

     batch_size: 40

source_embedding:

     path: en.wiki.bpe.vs25000.d300.w2v.txt

      with_header: False

     case_insensitive: True

target_embedding:

      path: bn.wiki.bpe.vs25000.d300.w2v.txt

      with_header: False

guillaumekln · June 17, 2020, 6:53am

I believe you should use BPEmb to tokenize your data, not SentencePiece.

Also source_embedding and target_embedding should be in the data section, as presented in the documentation: https://opennmt.net/OpenNMT-tf/embeddings.html

shan778 · June 18, 2020, 7:09am

How can I use BPEmb for tokenize data?
actually I am confused about the pipeline for data preprocessing using BPEmb

guillaumekln · June 18, 2020, 8:28am

There are examples in their README. If you need more help, you probably want to ask questions on the BPEmb repository.