Does OpenNMT support Arabic?

(Mohamed Zeid) #1

I searched for Arabic in the forum, but I found none. So, I am glad I am the first one asking about Arabic support in OpenNMT? If yes, I hope you can provide more info on the level of support 'what tokenizer do you use, etc?


(jean.senellart) #2

Hello @mzeid, we have trained some Arabic-English/French models that you can test here:

Tokenization is important indeed - we have been using for these models an in-house Arabic tokenizer. We have tried some BPE tokenization but it is not as good, and a CRF based tokenization which was producing quite good results. Let us know if you want more details.

(Mohamed Zeid) #3

Thanks Jean for your reply! I am planning to test OpenNMT, hopefully over this weekend. I will report my progress here.

(Lifeng Dong) #4

Hi @jean.senellart, for CN and JA training with OpenNMT, could you recommend me a tokenizer to use? Thanks!

(Mohamed Zeid) #5

Hello Jean,

I am not sure if I understand you correctly. Does this mean that OpenNMT doesn’t have a tokenizer for Arabic and this in-house tokenizer is not available in the open-source build? If this is the case, what tokenizer should we use with OpenNMT?

(jean.senellart) #6

Hello Mohamed, one of our Arabic expert will follow-up soon on that thread!

(Mohamed Zeid) #7

Thanks Jean! I look forward to hearing from him/her. I appreciate it.

(Raoum) #8

Hello Mohamed,

There is no specific tokenizer for Arabic (Linguistic tokenization) in OpenNMT. But there is a tokenizer for all languages (tokenize.lua). It provides a basic tokenization (tokenization of punctuation and Arabic diacritics) or a BPE tokenization if a BPE model is used.

An example of a basic tokenization using tokenize.lua:

Raw text: وقررت كذلك أن تُعقَد الدورة الثامنة والعشرون في الفترة من 21 أيلول/ سبتمبر إلى 2 تشرين الأول/أكتوبر 2009.
Tokenized: وقررت كذلك أن ت ■ُ■ عق ■َ■ د الدورة الثامنة والعشرون في الفترة من 21 أيلول ■/ سبتمبر إلى 2 تشرين الأول ■/■ أكتوبر 2009 ■.

(Mohamed Zeid) #9

Thanks @raoum for your reply. I see. It’s very basic indeed, but thanks for letting me know.

(wangfangfang) #10

Hi, will the models used in live system be open?