Hi,
I’m preparing my dataset for Arabic -> English
translation model and I have few questions
- Should I re-arrange tokens in source sentences? Arabic text starts from
right->left
unlike English, where tokens starts fromleft->right
. Should I reverse the list of Arabic tokens to make them align with English tokens? I’m also planning to get the alignments usingfast_align
library. - What’s the proper way to tokenize text? If I use SentencePiece model to generate tokens for Arabic/English language, does OpenNMT-tf handles this if I just give list of source and target tokens without any kind of arrangement modification?
- Any other research papers or links I can go through to get more idea on this?
Thanks!