Does OpenNMT support Arabic?

I searched for Arabic in the forum, but I found none. So, I am glad I am the first one asking about Arabic support in OpenNMT? If yes, I hope you can provide more info on the level of support 'what tokenizer do you use, etc?


Hello @mzeid, we have trained some Arabic-English/French models that you can test here:

Tokenization is important indeed - we have been using for these models an in-house Arabic tokenizer. We have tried some BPE tokenization but it is not as good, and a CRF based tokenization which was producing quite good results. Let us know if you want more details.

1 Like

Thanks Jean for your reply! I am planning to test OpenNMT, hopefully over this weekend. I will report my progress here.

Hi @jean.senellart, for CN and JA training with OpenNMT, could you recommend me a tokenizer to use? Thanks!

Hello Jean,

I am not sure if I understand you correctly. Does this mean that OpenNMT doesn’t have a tokenizer for Arabic and this in-house tokenizer is not available in the open-source build? If this is the case, what tokenizer should we use with OpenNMT?

Hello Mohamed, one of our Arabic expert will follow-up soon on that thread!

1 Like

Thanks Jean! I look forward to hearing from him/her. I appreciate it.

Hello Mohamed,

There is no specific tokenizer for Arabic (Linguistic tokenization) in OpenNMT. But there is a tokenizer for all languages (tokenize.lua). It provides a basic tokenization (tokenization of punctuation and Arabic diacritics) or a BPE tokenization if a BPE model is used.

An example of a basic tokenization using tokenize.lua:

Raw text: وقررت كذلك أن تُعقَد الدورة الثامنة والعشرون في الفترة من 21 أيلول/ سبتمبر إلى 2 تشرين الأول/أكتوبر 2009.
Tokenized: وقررت كذلك أن ت ■ُ■ عق ■َ■ د الدورة الثامنة والعشرون في الفترة من 21 أيلول ■/ سبتمبر إلى 2 تشرين الأول ■/■ أكتوبر 2009 ■.

1 Like

Thanks @raoum for your reply. I see. It’s very basic indeed, but thanks for letting me know.

Hi, will the models used in live system be open?