I’m preparing my dataset for
Arabic -> English translation model and I have few questions
- Should I re-arrange tokens in source sentences? Arabic text starts from
right->leftunlike English, where tokens starts from
left->right. Should I reverse the list of Arabic tokens to make them align with English tokens? I’m also planning to get the alignments using
- What’s the proper way to tokenize text? If I use SentencePiece model to generate tokens for Arabic/English language, does OpenNMT-tf handles this if I just give list of source and target tokens without any kind of arrangement modification?
- Any other research papers or links I can go through to get more idea on this?