Lesser BLEU score compared to Fairseq

Hello, I am trying to do translation from Enlish-Spanish with 35 Million sentences.

I created vocabulary of 32K using sentencepiece.

Fairseq Training:
Preprocessed ( converted to bin index ) using the steps here using the vocab from sentencepiece and trained using the command

fairseq-train --fp16 fairseq/data-bin/en-es/ --source-lang en --target-lang es --arch transformer_wmt_en_de --share-all-embeddings --criterion label_smoothed_cross_entropy --optimizer adam --adam-betas '(0.9, 0.98)' --warmup-updates 4000 --save-dir fairseq/checkpoints/en-es/ --no-progress-bar --log-interval 1000 --ddp-backend=no_c10d --clip-norm 0.0 --lr-scheduler inverse_sqrt --lr 0.0007 --label-smoothing 0.1 --max-tokens 4096 --update-freq 8
Using the transformer_base architecture. After 15 epochs I could get a BLEU score of 33.9 on the wmt13 test set

Opennmt :
I also tried to do the same here

  1. Preprocessing with command -
    onmt_preprocess -train_src ../data/parallel_data/training_data/processed_data/bpe/train.en -train_tgt ../data/parallel_data/training_data/processed_data/bpe/ -valid_src ../data/parallel_data/valid_data/bpe/valid.en -valid_tgt ../data/parallel_data/valid_data/bpe/ -save_data data/en-es/ --num_threads 16 --src_vocab ../vocabulary/opennmt_vocab/sentencepiece_en-es.vocab --tgt_vocab ../vocabulary/opennmt_vocab/sentencepiece_en-es.vocab --src_vocab_size 32000 --tgt_vocab_size 32000 --share_vocab

  2. I tried training with multiple commands
    onmt_train -data data/en-es/ -save_model checkpoints/en-es/ -layers 6 -rnn_size 512 -word_vec_size 512 -share_decoder_embeddings -share_embeddings -transformer_ff 2048 -heads 8 -encoder_type transformer -decoder_type transformer -position_encoding -train_steps 75000 --max_generator_batches 2 -dropout 0.1 -batch_size 4096 -batch_type tokens -normalization tokens -accum_count 2 -optim adam -adam_beta2 0.998 -decay_method noam -warmup_steps 8000 -learning_rate 2 -max_grad_norm 0 -param_init 0 -param_init_glorot -label_smoothing 0.1 -valid_steps 5000 -save_checkpoint_steps 5000 -world_size 8 -gpu_ranks 0 1 2 3 4 5 6 7 -log_file log/en-es/en_en.log -exp transformer_base_en_es -report_every 1000

which gave a BLEU of 30.9 ( followed

and changed warmup steps to 4000 and accum_count to 8 and got a BLEU of 31.2

These BLEU scores seem lesser than fairseq and all these are done with the best checkpoint in opennmt ( one with least validation perplexity ). All these trainings have been done on 8 GPUs

I observed a similar pattern for a different task as well ( Opennmt-Py scoring a bit lower compared to fairseq) . Is there any reason why it is lesser than fairseq or am I missing something ?

Thanks in advance.

did you shuffle your dataset before preprocessing ?

Just FYI, with a medium size Transformer (6 layers 768 12 heads) I am getting 35.8 on NT13.
So the 31.2 looks low. Depends on your data however.

Sorry for the later reply. I didn’t shuffle the data but is that going to make that much of a difference ?
I used OPUS data from multiple sources. I didn’t add all however and selected quality ones like unpc etc.
Does this have anything to me using sentence piece vocab ? From what I see here Opennmt expects vocab with one token in each line so I converted the sentencepiece vocab to opennmt by removing the indexes in the sentencepiece file below

<unk>    0
<s>    0
</s>    0
de    -0
▁a    -1

Should I have to do something different when using sentencepiece vocab ?

shuffle your data before preprocessing and train again, it should make a difference.

Thanks for the feedback.

Hi, I tried multiple experiments with shuffling the dataset as you mentioned.
But even with transformer_big the best BLEU score I could get is 32. I tried hyperparams from here Big Transformer model parameters
Can you share the command you used to train and how many steps you trained for.
Did you use the same data given in WMT 13 or any additional data ?