OpenNMT Forum

Arabic machine translation

Hello sir,

I m generating a translation model between Arabizi (arabic dialecte written in latin script ) and arabic . the probleme is that i am getting a blue score of 0.0 . i don’t know if OpenNMT support arabic lenguage or not !! please could someone help.


OpenNMT works with any unicode languages. If you describe your data preparation and training process, people may be able to assist you.

1 Like

Hello , i have a dataset of 3000 sentence , i am using the quistart : here it the configuration i am using
for preprocessing :
th preprocess.lua -train_src data/src-train.tok -train_tgt data/tgt-train.tok -valid_src data/src-val.tok -valid_tgt data/tgt-val.tok -save_data data/demo

For training:
th train.lua -encoder_type brnn -global_attention dot -max_batch_size 64 -data data/demo-train.t7 -save_model model_trans/demo-model

Sorry, 3000 sentences is usually not enough to train a NMT system.

If possible, you should try with a couple hundred thousand sentences to start getting good result.

Hello everyone, I start to use OpenNMT to translate English-to-Arabic datasets but results were so bad and pred-scores were always negative, so please could you help me to know the best ways to prepare the corpus before start to processes and train the MT or understand the way OpenNMT use while processes and train using Arabic corpus.

How many data did you use for the training? How did you prepare your data so far?

It’s about 14219 sentences
first I Remove Invalid Characters
then Remove Redundancy
then I start to clear the files through :
1- I put spaces before and after words that have the special character ‘&’, whether the character has a space after it or not.
2- Replace full stops/periods at the end of each line with space + full stop/period, remove Not Used words (i.e. identical lines) and remove the space created previously (trims the two lines) before words that have ‘&‘ character.
3- Replaces numbers with ##NUM##.
4- Puts spaces around numbers and removes redundant spaces
5- Puts spaces around foreign sequences/ “not-target” characters found in the target file
6- Remove the unwanted spaces before/after a sentence
7- Insert spaces before and after some Predefined special characters
8- Remove the unwanted spaces before/after a sentence
Then after cleaning I start to generate a suggested abbreviation lists for both source and target files in the corpus. According to ONE criterion.The two output files are then revised by linguists.

Then I start segmentation (Convert Lines into Segments ) by converting paragraphs into segments through :
1- Remove the dot in the end of a paragraph/line.
2- Generates lots of empty line
Then I Filter Segments (not sentences) by :
1- Remove empty lines.
2- Put spaces between special characters if they are in the middle of the sentence, or put a space
after it if the special character starts the sentence.

Then I clean the test set (put a white space before a full stop, remove the unwanted spaces in any sentence and in the whole file and put white spaces before and after the special characters like ‘!’) and replace number with the tag ##NUM## reserving the index of each number in the file.

Here is two screenshots for both source and target files after my preparation and before I run the files on openNMT tool.

This is the same issue as the original poster, 14219 sentences is usually not enough to train a NMT system.

Hey @guillaumekln, Is that possible to get a good result with the help of pre-trained vectors? Like 10,000 sentence pairs with fasttext pre-trained embeddings.

Pretrained embeddings should help, but I don’t know if you can get good results at the end. You should probably try it out.

did you try it out, and what is the result ??

Is there any public dataset for Arabic > Arabizi? I would be grateful if you can let me know where I can find such dataset.