Doubt with overview of NMT Pipeline

I am currently a bachelor’s student, trying my hands on NLP and MT.

I need some guidance with the necessary steps to train an NMT Model. As per my knowledge, these are the steps to train an NMT model

  1. Tokenization both Source and Target language.
  2. Applying BPE on tokenized data.
  3. Applying to preprocess step on BPE data (Wordembeddings, etc.).
  4. Training required model (LSTM or Transformer)
  5. Decoding BPE and Translating.

Please correct me if I am wrong.
Another question I have is, Do I need to detokenize the translated data??

Many thanks in advance for the help.
Is there any blog that can guide me with the NMT Pipeline process??

Hi there,
This post might be helpful to better understand the tokenization/subwords part: Using Sentencepiece/Byte Pair Encoding on Model

@francoishernandez thanks for your reply,
I have one more query, Do I need to apply Tokenization as well as BPE on raw text or BPE alone is enough for preprocessing data??

I think BPE requires some form of pretokenization to work best. If you use OpenNMT/Tokenizer for instance, both steps can be handled quite easily.

1 Like

@francoishernandez thanks for your time and help.