Arabic to English translation model

Hi,

I’m preparing my dataset for Arabic -> English translation model and I have few questions

  1. Should I re-arrange tokens in source sentences? Arabic text starts from right->left unlike English, where tokens starts from left->right. Should I reverse the list of Arabic tokens to make them align with English tokens? I’m also planning to get the alignments using fast_align library.
  2. What’s the proper way to tokenize text? If I use SentencePiece model to generate tokens for Arabic/English language, does OpenNMT-tf handles this if I just give list of source and target tokens without any kind of arrangement modification?
  3. Any other research papers or links I can go through to get more idea on this?

Thanks!

You don’t need to re-arrange the arabic side.
SentencePiece is a good choice.
you may find this paper interesting https://workshop2016.iwslt.org/downloads/qcri-machine-translation.pdf
but there are some others, google it.

1 Like

Thank you! That was really helpful.