Does OpenNMT support Arabic?

mzeid · April 5, 2017, 10:21am

I searched for Arabic in the forum, but I found none. So, I am glad I am the first one asking about Arabic support in OpenNMT? If yes, I hope you can provide more info on the level of support 'what tokenizer do you use, etc?

Thanks,
mzeid

jean.senellart · April 5, 2017, 9:52pm

Hello @mzeid, we have trained some Arabic-English/French models that you can test here: https://demo-pnmt.systran.net/production#/translation.

Tokenization is important indeed - we have been using for these models an in-house Arabic tokenizer. We have tried some BPE tokenization but it is not as good, and a CRF based tokenization which was producing quite good results. Let us know if you want more details.

mzeid · April 6, 2017, 12:08am

Thanks Jean for your reply! I am planning to test OpenNMT, hopefully over this weekend. I will report my progress here.

lifeng · April 7, 2017, 7:40am

Hi @jean.senellart, for CN and JA training with OpenNMT, could you recommend me a tokenizer to use? Thanks!

mzeid · April 13, 2017, 9:37pm

Hello Jean,

I am not sure if I understand you correctly. Does this mean that OpenNMT doesn’t have a tokenizer for Arabic and this in-house tokenizer is not available in the open-source build? If this is the case, what tokenizer should we use with OpenNMT?

jean.senellart · April 14, 2017, 6:15am

Hello Mohamed, one of our Arabic expert will follow-up soon on that thread!

mzeid · April 14, 2017, 6:48am

Thanks Jean! I look forward to hearing from him/her. I appreciate it.

raoum · April 19, 2017, 11:40am

Hello Mohamed,

There is no specific tokenizer for Arabic (Linguistic tokenization) in OpenNMT. But there is a tokenizer for all languages (tokenize.lua). It provides a basic tokenization (tokenization of punctuation and Arabic diacritics) or a BPE tokenization if a BPE model is used.

An example of a basic tokenization using tokenize.lua:

Raw text: وقررت كذلك أن تُعقَد الدورة الثامنة والعشرون في الفترة من 21 أيلول/ سبتمبر إلى 2 تشرين الأول/أكتوبر 2009.
Tokenized: وقررت كذلك أن ت ￭ُ￭ عق ￭َ￭ د الدورة الثامنة والعشرون في الفترة من 21 أيلول ￭/ سبتمبر إلى 2 تشرين الأول ￭/￭ أكتوبر 2009 ￭.

mzeid · April 20, 2017, 11:01pm

Thanks @raoum for your reply. I see. It’s very basic indeed, but thanks for letting me know.

wangfangfang · February 5, 2018, 6:29am

Hi, will the models used in live system be open?