Getting poor BLEU Scores!

Hello there,

I’m studiying on WMT18 corpuses and test sets.

With about 200K Turkish-English corpus I got BLEU score of about 12 which is normal I think because the corpus is very small.
With about 2 million German-English corpus I got BLEU score of about 23 and again it is normal I think because the corpus is not very big for NMT. With Moses (SMT) tool I got 22.6 BLEU score.

Now I want to get higher (greater than 30) BLEU scores but I couldn’t do that whatever I done.

First I began with selecting a bigger corpus (5 million, 7 million sentences etc. with different languages) but with classical approach I faced with many “unk” character. This is normal because the distint words are too much now. But when I grew up the vocabulary size (from 50K to 100K for example) I got “CUDA out of memory” error.

Then I decided to use BPE technics. I used SentencePiece, subword-NMT, vs. but still I connot get good BLEU scores. Every time the scores were around 20.

I suppose I’m making mistake somewhere. Can you give advices what can I do to get higher BLEU scores?

Thanks in advance.

give more info.
what corpus did you use for de-en ?
post your command lines for sentence piece and for preprocess + training.


I used several different corpora, for example:

de–>en: Europerl + CommonCrawl + EU Press Release (all together): About 3,7 million sentences
de -->en: Europerl + News Commetary + EU Press Release (all together): About 3,5 million sentences
de–>fr (from WMT19): Paracrawl : About 7 million sentences

But all the results are poor.

First of all I tokenize and truecase my all files with Moses tools.

My command lines for SentencePiece are like:

spm_train --input=train-en-de.en --model_prefix=model-en --vocab_size=32000 --model_type=bpe
spm_encode --model=model-en.model --output_format=piece < train-en-de.en > train-en-de-piece.en

Then I used these encoded train files (I encoded the validation files too) to train my model.

For preprosseing with OpenNMT:

python3 -train_src mydata/train-en-de-piece.en -train_tgt mydata/ -valid_src mydata/dev-en-de-piece.en -valid_tgt mydata/ -save_data mydata/preprocessed-en-de-piece -src_seq_length 100 -tgt_seq_length 100 -src_vocab_size 32000 -tgt_vocab_size 32000

For training:

python3 -data mydata/preprocessed-en-de-piece -save_model mydata/model-en-de-piece -batch_size 32 -gpu_ranks 0

After training, I also encoded my test file (newstest2018.en) with SentencePiece and translate it with the model:

python3 -model mytranslate/ -src mytranslate/test-en-de-en-piece.en -output mytranslate/ -replace_unk -verbose -gpu 0

After that I decode the pred file (with spm_decode) and after detokinesing it, evaluate the result with SacreBleu:

cat myevaluate/ | sacrebleu myevaluate/

I’ve done lots of test so maybe I wrote the wrong commands but the summary is like this.


ok by default you are training a very small LSTM model.
read this
“how do I train a transformer”

So you say that poor results are because of small LSTM model.

But i expect that when the corpus is bigger than the previous ones, the results must be better.

Anyway i’m gonna try with transformer and write the result here.

Hi @emresatir, For Turkish-English I would definitely suggest SentencePiece -> Transformer. I can’t locate my precise BLEU scores in my notebook at present but they were certainly in the mid-40’s for both directions. You can find this language pair on my demo site at Good luck :slight_smile:

Hi @tel34

Thanks for your reply and suggestion.

Now I’m trying transformer on an “de–>en” system. After that I’m gonna try for Turkish-English.

Just found my notes. Using the transformer (100K steps) I got 59.18 for English-Turkish.

Terence, when you give a score like this, it’s always better to be very specific, give the testset you used, whether it’s an extract of training data (if so what are your training data), or a public testset and if so, how it compares to previous publications.
Otherwise, it’s quite meaningless.

Sorry, I should have sent this as a private message as it was just intended as a side comment. I can see that it is not of any real use for general information purposes.

don’t get me wrong, I like score! just like to know what they are related to…

OK - I’ll give a proper account when I get to my desk tomorrow :slight_smile:

maybe you can score newstest2018 and compare to this:

Hi all,

I got 28.3 BLEU score with the Europerl Corpus (about 2 million sentences) “de–>en” using Transformer. It was 22.9 before with the default small LSTM model. It’s quite better i suppose. Maybe I can get higher a score with a larger corpus.

I first encode files with SentencePiece (bpe and 16K for both sides). After preprocessing with OpenNMT I trained my model with the parameters here: (but I had to half the batch size (i.e. 2048) because of getting CUDA out of memory error and I trained only 100K instead of 200K)

Thanks for your help.