OpenNMT Forum

Running Sentence Piece Model on English --> Arabic gives bad results

Hi There,

I’m using the OpenNMT-tf repo for building the English-Arabic model. Previously, I have used the same for building English-Spanish model which has given pretty decent results like BLEU scores in the mid 60’s. I had used the following steps for English - Spanish:
Tokenize(OpenNMT Tokenizer) - BPE - Learn Alignments - Learn Model with Alignments

This has worked well for Domain Adaption (DA) as well. Now, when I run the same steps for English to Arabic with the default OpenNMT tokenization I get okayish scores like ~25 BLEU (not bad when compared to research paper results I have seen). Lot of the topics here suggest to try Sentence Piece tokenization for Arabic language which has seem to work in the past. In short, when I try running sentence piece on the my dataset for some weird reason predictions turn out to be in English however I want them to be in Arabic.

Below are the steps I have run-

  • Downloaded the spm library from Google Sentence Piece

  • Combine the train dataset into one file to learn joint sentence piece model

    cat train_large_en.txt train_large_ar.txt > /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/sp_input_combine_file.txt

  • Train the model on Combined english-arabic dataset (its a combination of subsets from QCIR Domain Corpus, UN parallel data, Arab Acquis data, OpenSubtitles ; in total about 2M sentences)

    spm_train --input=/home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/sp_input_combine_file.txt --model_prefix=spm --vocab_size=32000 --character_coverage=0.9995

  • Generate the English and Arabic vocabulary files
    spm_encode --model= /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/spm.model --generate_vocabulary < train_large_en.txt > /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp_vocab/vocab.en

    spm_encode --model=/home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/spm.model --generate_vocabulary < train_large_ar.txt > /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp_vocab/vocab.ar

  • Encode the train,dev,test files (english and arabic) with the learnt spm model
    spm_encode --model=/home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/spm.model --vocabulary=/home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp_vocab/vocab.en < /home/ubuntu/mayub/datasets/raw/arabic/large_v2/test_large_en.txt > /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/test_sp_applied.en

  • (Optional Generating the alignments file)

  • Pass above spm encoded files to OpenNMT-tf for Training the NMT model. Config file is as below:

model_dir: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/en_ar_transformer_b/

data:
train_features_file: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/train_sp_applied.en
train_labels_file: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/train_sp_applied.ar
eval_features_file: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/dev_sp_applied.en
eval_labels_file: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/dev_sp_applied.en
source_words_vocabulary: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/spm_clean_src.txt
target_words_vocabulary: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/spm_clean_trg.txt

params:
guided_alignment_type: ce
guided_alignment_weight: 1
replace_unknown_target: true

train:
save_checkpoints_steps: 1000
keep_checkpoint_max: 3
save_summary_steps: 1000
train_steps: 25000
batch_size: 1024
effective_batch_size: null
maximum_features_length: 100
maximum_labels_length: 100

eval:
eval_delay: 1800
external_evaluators: [BLEU,BLEU-detok]

  • train command:
    onmt-main train_and_eval --model_type Transformer --config /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/actual_run_3_sp.yml --auto_config --num_gpus 1 2>&1 | tee /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/en_ar_transformer_large_sp.log

Sample Predictions File -

All files that I’m using are in available here for reference. I’m not sure if I’m missing something which I should be doing. Please advice.

Thanks !

Mohammed Ayub

Hi,

You should probably try training separate SentencePiece models for this case.