Running Sentence Piece Model on English --> Arabic gives bad results

mayub · September 30, 2019, 1:49am

Hi There,

I’m using the OpenNMT-tf repo for building the English-Arabic model. Previously, I have used the same for building English-Spanish model which has given pretty decent results like BLEU scores in the mid 60’s. I had used the following steps for English - Spanish:
Tokenize(OpenNMT Tokenizer) - BPE - Learn Alignments - Learn Model with Alignments

This has worked well for Domain Adaption (DA) as well. Now, when I run the same steps for English to Arabic with the default OpenNMT tokenization I get okayish scores like ~25 BLEU (not bad when compared to research paper results I have seen). Lot of the topics here suggest to try Sentence Piece tokenization for Arabic language which has seem to work in the past. In short, when I try running sentence piece on the my dataset for some weird reason predictions turn out to be in English however I want them to be in Arabic.

Below are the steps I have run-

Downloaded the spm library from Google Sentence Piece
Combine the train dataset into one file to learn joint sentence piece model

cat train_large_en.txt train_large_ar.txt > /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/sp_input_combine_file.txt
Train the model on Combined english-arabic dataset (its a combination of subsets from QCIR Domain Corpus, UN parallel data, Arab Acquis data, OpenSubtitles ; in total about 2M sentences)

spm_train --input=/home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/sp_input_combine_file.txt --model_prefix=spm --vocab_size=32000 --character_coverage=0.9995
Generate the English and Arabic vocabulary files
spm_encode --model= /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/spm.model --generate_vocabulary < train_large_en.txt > /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp_vocab/vocab.en

spm_encode --model=/home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/spm.model --generate_vocabulary < train_large_ar.txt > /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp_vocab/vocab.ar
Encode the train,dev,test files (english and arabic) with the learnt spm model
spm_encode --model=/home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/spm.model --vocabulary=/home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp_vocab/vocab.en < /home/ubuntu/mayub/datasets/raw/arabic/large_v2/test_large_en.txt > /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/test_sp_applied.en
(Optional Generating the alignments file)
Pass above spm encoded files to OpenNMT-tf for Training the NMT model. Config file is as below:

model_dir: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/en_ar_transformer_b/

data:
train_features_file: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/train_sp_applied.en
train_labels_file: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/train_sp_applied.ar
eval_features_file: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/dev_sp_applied.en
eval_labels_file: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/dev_sp_applied.en
source_words_vocabulary: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/spm_clean_src.txt
target_words_vocabulary: /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/sp/spm_clean_trg.txt

params:
guided_alignment_type: ce
guided_alignment_weight: 1
replace_unknown_target: true

train:
save_checkpoints_steps: 1000
keep_checkpoint_max: 3
save_summary_steps: 1000
train_steps: 25000
batch_size: 1024
effective_batch_size: null
maximum_features_length: 100
maximum_labels_length: 100

eval:
eval_delay: 1800
external_evaluators: [BLEU,BLEU-detok]

train command:
onmt-main train_and_eval --model_type Transformer --config /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/actual_run_3_sp.yml --auto_config --num_gpus 1 2>&1 | tee /home/ubuntu/mayub/datasets/in_use/arabic/actual_run3_sp/en_ar_transformer_large_sp.log

Sample Predictions File -

All files that I’m using are in available here for reference. I’m not sure if I’m missing something which I should be doing. Please advice.

Thanks !

Mohammed Ayub

guillaumekln · September 30, 2019, 7:42am

Hi,

You should probably try training separate SentencePiece models for this case.

mayub · February 11, 2020, 9:27pm

I tried training separate sentence piece models for English & Arabic. Also generated separate vocabularies. But did not have any luck getting good predictions.

guillaumekln · February 12, 2020, 8:51am

Did you train it further than 25000 steps?

mayub · February 19, 2020, 3:13pm

Yes. I ran it for another 25000 (total 50000). The bleu was somehow 0.00 from the start, my guess its because I’m setting something incorrectly with vocabularies. (I’m using version 1.24.1)

My model config is as below:

vocab_32.ar and vocab_32k.en – I have been built them separately as suggested (shown below).

Below are the spm commands I run to generate features and labels file -

– Train Model
spm_train --input train_large_ar_lower.txt --model_prefix=spm_ar_bpe --vocab_size=32000 --character_coverage=0.9995 --model_type=bpe

– Generate Vocabulary
spm_encode --model=spm_ar_bpe.model --generate_vocabulary < train_large_ar_lower.txt > vocab_32k.ar (I use this file in config)

– Encode File with restriction
spm_encode --model=spm_ar_bpe.model --vocabulary=vocab_32k.ar --vocabulary_threshold=50 < train_large_ar_lower.txt > train_sp_bpe_applied.ar

(Same steps are done for english and train, dev files)

Also, vocab_32.en looks something like this : (I don’t see <unk>, <s>,</s> in this file is that a problem.)

Thanks !

guillaumekln · February 19, 2020, 3:17pm

If this is the vocabulary format you are passing to OpenNMT-tf, it is indeed incorrect.

Check the documentation to convert a SentencePiece vocabulary to the format expected by OpenNMT-tf: https://github.com/OpenNMT/OpenNMT-tf/blob/v1.24.1/docs/data.md#via-sentencepiece-vocabulary

mayub · February 19, 2020, 3:36pm

Great. That should do the trick. Will let you know in sometime how it goes.

Thanks !

mayub · February 19, 2020, 9:06pm

It ran successfully. Although Sentence Piece for English -> Arabic did not do that great as running the usual OpenNMT-tf process. Using Sentence Piece BLEU was ~20 and without OpenNMT-tf tokenization I get ~28 .