Arabic to English translation model

spatel · May 2, 2019, 5:45pm

Hi,

I’m preparing my dataset for Arabic -> English translation model and I have few questions

Should I re-arrange tokens in source sentences? Arabic text starts from right->left unlike English, where tokens starts from left->right. Should I reverse the list of Arabic tokens to make them align with English tokens? I’m also planning to get the alignments using fast_align library.
What’s the proper way to tokenize text? If I use SentencePiece model to generate tokens for Arabic/English language, does OpenNMT-tf handles this if I just give list of source and target tokens without any kind of arrangement modification?
Any other research papers or links I can go through to get more idea on this?

Thanks!

vince62s · May 3, 2019, 6:29am

You don’t need to re-arrange the arabic side.
SentencePiece is a good choice.
you may find this paper interesting https://workshop2016.iwslt.org/downloads/qcri-machine-translation.pdf
but there are some others, google it.

spatel · May 5, 2019, 2:53pm

Thank you! That was really helpful.