I searched for Arabic in the forum, but I found none. So, I am glad I am the first one asking about Arabic support in OpenNMT? If yes, I hope you can provide more info on the level of support 'what tokenizer do you use, etc?
Thanks,
mzeid
I searched for Arabic in the forum, but I found none. So, I am glad I am the first one asking about Arabic support in OpenNMT? If yes, I hope you can provide more info on the level of support 'what tokenizer do you use, etc?
Thanks,
mzeid
Hello @mzeid, we have trained some Arabic-English/French models that you can test here: https://demo-pnmt.systran.net/production#/translation.
Tokenization is important indeed - we have been using for these models an in-house Arabic tokenizer. We have tried some BPE tokenization but it is not as good, and a CRF based tokenization which was producing quite good results. Let us know if you want more details.
Thanks Jean for your reply! I am planning to test OpenNMT, hopefully over this weekend. I will report my progress here.
Hi @jean.senellart, for CN and JA training with OpenNMT, could you recommend me a tokenizer to use? Thanks!
Hello Jean,
I am not sure if I understand you correctly. Does this mean that OpenNMT doesn’t have a tokenizer for Arabic and this in-house tokenizer is not available in the open-source build? If this is the case, what tokenizer should we use with OpenNMT?
Hello Mohamed, one of our Arabic expert will follow-up soon on that thread!
Thanks Jean! I look forward to hearing from him/her. I appreciate it.
Hello Mohamed,
There is no specific tokenizer for Arabic (Linguistic tokenization) in OpenNMT. But there is a tokenizer for all languages (tokenize.lua). It provides a basic tokenization (tokenization of punctuation and Arabic diacritics) or a BPE tokenization if a BPE model is used.
An example of a basic tokenization using tokenize.lua:
Raw text: وقررت كذلك أن تُعقَد الدورة الثامنة والعشرون في الفترة من 21 أيلول/ سبتمبر إلى 2 تشرين الأول/أكتوبر 2009.
Tokenized: وقررت كذلك أن ت ■ُ■ عق ■َ■ د الدورة الثامنة والعشرون في الفترة من 21 أيلول ■/ سبتمبر إلى 2 تشرين الأول ■/■ أكتوبر 2009 ■.
Hi, will the models used in live system be open?