OpenNMT Forum

Lesser BLEU score compared to Fairseq

Hello, I am trying to do translation from Enlish-Spanish with 35 Million sentences.

I created vocabulary of 32K using sentencepiece.

Fairseq Training:
Preprocessed ( converted to bin index ) using the steps here using the vocab from sentencepiece and trained using the command

fairseq-train --fp16 fairseq/data-bin/en-es/ --source-lang en --target-lang es --arch transformer_wmt_en_de --share-all-embeddings --criterion label_smoothed_cross_entropy --optimizer adam --adam-betas '(0.9, 0.98)' --warmup-updates 4000 --save-dir fairseq/checkpoints/en-es/ --no-progress-bar --log-interval 1000 --ddp-backend=no_c10d --clip-norm 0.0 --lr-scheduler inverse_sqrt --lr 0.0007 --label-smoothing 0.1 --max-tokens 4096 --update-freq 8
Using the transformer_base architecture. After 15 epochs I could get a BLEU score of 33.9 on the wmt13 test set

Opennmt :
I also tried to do the same here

  1. Preprocessing with command -
    onmt_preprocess -train_src ../data/parallel_data/training_data/processed_data/bpe/train.en -train_tgt ../data/parallel_data/training_data/processed_data/bpe/ -valid_src ../data/parallel_data/valid_data/bpe/valid.en -valid_tgt ../data/parallel_data/valid_data/bpe/ -save_data data/en-es/ --num_threads 16 --src_vocab ../vocabulary/opennmt_vocab/sentencepiece_en-es.vocab --tgt_vocab ../vocabulary/opennmt_vocab/sentencepiece_en-es.vocab --src_vocab_size 32000 --tgt_vocab_size 32000 --share_vocab

  2. I tried training with multiple commands
    onmt_train -data data/en-es/ -save_model checkpoints/en-es/ -layers 6 -rnn_size 512 -word_vec_size 512 -share_decoder_embeddings -share_embeddings -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 75000 --max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 5000 -save_checkpoint_steps 5000 -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7 -log_file log/en-es/en_en.log -exp transformer_base_en_es -report_every 1000

which gave a BLEU of 30.9 ( followed

and changed warmup steps to 4000 and accum_count to 8 and got a BLEU of 31.2

These BLEU scores seem lesser than fairseq and all these are done with the best checkpoint in opennmt ( one with least validation perplexity ). All these trainings have been done on 8 GPUs

I observed a similar pattern for a different task as well ( Opennmt-Py scoring a bit lower compared to fairseq) . Is there any reason why it is lesser than fairseq or am I missing something ?

Thanks in advance.

1 Like

did you shuffle your dataset before preprocessing ?

Just FYI, with a medium size Transformer (6 layers 768 12 heads) I am getting 35.8 on NT13.
So the 31.2 looks low. Depends on your data however.

Sorry for the later reply. I didn’t shuffle the data but is that going to make that much of a difference ?
I used OPUS data from multiple sources. I didn’t add all however and selected quality ones like unpc etc.
Does this have anything to me using sentence piece vocab ? From what I see here Opennmt expects vocab with one token in each line so I converted the sentencepiece vocab to opennmt by removing the indexes in the sentencepiece file below

<unk>    0
<s>    0
</s>    0
de    -0
▁a    -1

Should I have to do something different when using sentencepiece vocab ?

shuffle your data before preprocessing and train again, it should make a difference.

Thanks for the feedback.