Doubt with overview of NMT Pipeline

Rishi · June 10, 2020, 10:26pm

I am currently a bachelor’s student, trying my hands on NLP and MT.

I need some guidance with the necessary steps to train an NMT Model. As per my knowledge, these are the steps to train an NMT model

Tokenization both Source and Target language.
Applying BPE on tokenized data.
Applying to preprocess step on BPE data (Wordembeddings, etc.).
Training required model (LSTM or Transformer)
Decoding BPE and Translating.

Please correct me if I am wrong.
Another question I have is, Do I need to detokenize the translated data??

Many thanks in advance for the help.
Is there any blog that can guide me with the NMT Pipeline process??

francoishernandez · June 11, 2020, 7:58am

Hi there,
This post might be helpful to better understand the tokenization/subwords part: Using Sentencepiece/Byte Pair Encoding on Model

Rishi · June 11, 2020, 9:28pm

@francoishernandez thanks for your reply,
I have one more query, Do I need to apply Tokenization as well as BPE on raw text or BPE alone is enough for preprocessing data??

francoishernandez · June 15, 2020, 7:55am

I think BPE requires some form of pretokenization to work best. If you use OpenNMT/Tokenizer for instance, both steps can be handled quite easily.

Rishi · June 16, 2020, 7:48pm

@francoishernandez thanks for your time and help.