Different inference result between ctranslate2 and checkpoint

Hello,
I get a different translation result when using a checkpoint and it’s converted ctranslate2 model,
The inference from the checkpoint:

onmt-main --config data.yml --auto_config --checkpoint_path run/ckpt-5000 infer --features_file src-val.txt --predictions_file pred.txt

Converting the checkpoint to a ctranslate2 model:

ct2-opennmt-tf-converter --model_path model/ --model_spec TransformerBaseRelative --output_dir model_ctranslate2 --src_vocab src-vocab.txt --tgt_vocab tgt-vocab.txt --quantization int16

The inference from the ctranslate2 model:

translator = ctranslate2.Translator(app.root_path + “/ctranslate_model”)
text_list = translator.translate_batch(source_list)

Translation result:
source sentence:
▁афинал ▁мшаԥымза ▁анҵәамҭазы ▁имҩаԥысраны ▁иҟоуп ▁.
Target result with checkpoint inference:
▁в ▁конце ▁апреля ▁пройдет ▁финал ▁.
Target result with ctranslate2 inference:
▁несов мести ла ▁друг ▁мел ью ▁мел кий ▁и ▁пропове емо ▁, ▁заня ла ▁благой ▁вестью ▁, ▁заня ла ▁также ▁самим ▁, ▁заня ющим ▁благой ▁вестью ▁и ▁деньги ▁, ▁заня ла ▁выпуски ▁деньги ▁и ▁деньги ▁, ▁заня ла ▁для ▁деньги
What am I missing?!

Can you try exporting from OpenNMT-tf?

onmt-main --config data.yml --auto_config --checkpoint_path run/ckpt-5000 export --export_dir model_ctranslate2 --export_format ctranslate2

I got the same result.

First, make sure to use latest version of ctranslate2.

Are you sure you are comparing the same model? Are all translation results different?

I will double check and let you know.

Please, disregard the previous result.

This time I used 15k step checkpoint, and 15k step ctranslate2 model, I translated 50 sentences, even though the tokenized BLEU score is close:
BLEU+tok+checkpoint = 22.92, 53.4/31.8/22.9/17.4 (BP=0.800, ratio=0.817, hyp_len=1426, ref_len=1745)
BLEU+tok+ctranslate2 = 22.76, 48.8/27.4/20.1/16.2 (BP=0.886, ratio=0.892, hyp_len=1556, ref_len=1745)
There are differences in translation that brings some concern to me.
Should I be expecting such differences?
Here is a link to reproduce the results: 15k model and test data

Is this with int16 quantization as used in the first post? If yes, differences are to be expected.

Also make sure to use the same beam size, translator.translate_batch(source_list, beam_size=4).

No, I used this:

onmt-main --config data.yml --auto_config --checkpoint_path run/ckpt-15000 export --export_dir model_ctranslate2 --export_format ctranslate2

I’ll try the beam_size.

I added beam_size=4, looks like it got worse.

BLEU+tok+checkpoint = 22.92, 53.4/31.8/22.9/17.4 (BP=0.800, ratio=0.817, hyp_len=1426, ref_len=1745)
BLEU+tok+ctranslate2 = 20.37, 49.4/27.6/19.7/15.2 (BP=0.806, ratio=0.823, hyp_len=1436, ref_len=1745)

That’s strange. What parameters did you use in OpenNMT-tf besides --auto_config?

These are the parameters I am using during the conversion to ctranslate2:

INFO:tensorflow:Using parameters:
data:
  eval_features_file: src-val.txt
  eval_labels_file: tgt-val.txt
  source_vocabulary: src-vocab.txt
  target_vocabulary: tgt-vocab.txt
  train_features_file: src-train.txt
  train_labels_file: tgt-train.txt
eval:
  batch_size: 32
  batch_type: examples
  length_bucket_width: 5
infer:
  batch_size: 32
  batch_type: examples
  length_bucket_width: 5
model_dir: run/
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: NoamDecay
  label_smoothing: 0.1
  learning_rate: 2.0
  num_hypotheses: 1
  optimizer: LazyAdam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 500000
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_summary_steps: 100