Different inference result between ctranslate2 and checkpoint

Nart · July 8, 2020, 10:56am

Hello,
I get a different translation result when using a checkpoint and it’s converted ctranslate2 model,
The inference from the checkpoint:

onmt-main --config data.yml --auto_config --checkpoint_path run/ckpt-5000 infer --features_file src-val.txt --predictions_file pred.txt

Converting the checkpoint to a ctranslate2 model:

ct2-opennmt-tf-converter --model_path model/ --model_spec TransformerBaseRelative --output_dir model_ctranslate2 --src_vocab src-vocab.txt --tgt_vocab tgt-vocab.txt --quantization int16

The inference from the ctranslate2 model:

translator = ctranslate2.Translator(app.root_path + “/ctranslate_model”)
text_list = translator.translate_batch(source_list)

Translation result:
source sentence:
▁афинал ▁мшаԥымза ▁анҵәамҭазы ▁имҩаԥысраны ▁иҟоуп ▁.
Target result with checkpoint inference:
▁в ▁конце ▁апреля ▁пройдет ▁финал ▁.
Target result with ctranslate2 inference:
▁несов мести ла ▁друг ▁мел ью ▁мел кий ▁и ▁пропове емо ▁, ▁заня ла ▁благой ▁вестью ▁, ▁заня ла ▁также ▁самим ▁, ▁заня ющим ▁благой ▁вестью ▁и ▁деньги ▁, ▁заня ла ▁выпуски ▁деньги ▁и ▁деньги ▁, ▁заня ла ▁для ▁деньги
What am I missing?!

guillaumekln · July 8, 2020, 11:01am

Can you try exporting from OpenNMT-tf?

onmt-main --config data.yml --auto_config --checkpoint_path run/ckpt-5000 export --export_dir model_ctranslate2 --export_format ctranslate2

Nart · July 8, 2020, 11:08am

I got the same result.

guillaumekln · July 8, 2020, 12:25pm

First, make sure to use latest version of ctranslate2.

Are you sure you are comparing the same model? Are all translation results different?

Nart · July 8, 2020, 12:30pm

I will double check and let you know.

Nart · July 9, 2020, 9:57am

Please, disregard the previous result.

This time I used 15k step checkpoint, and 15k step ctranslate2 model, I translated 50 sentences, even though the tokenized BLEU score is close:
BLEU+tok+checkpoint = 22.92, 53.4/31.8/22.9/17.4 (BP=0.800, ratio=0.817, hyp_len=1426, ref_len=1745)
BLEU+tok+ctranslate2 = 22.76, 48.8/27.4/20.1/16.2 (BP=0.886, ratio=0.892, hyp_len=1556, ref_len=1745)
There are differences in translation that brings some concern to me.
Should I be expecting such differences?
Here is a link to reproduce the results: 15k model and test data

guillaumekln · July 9, 2020, 10:02am

Is this with int16 quantization as used in the first post? If yes, differences are to be expected.

Also make sure to use the same beam size, translator.translate_batch(source_list, beam_size=4).

Nart · July 9, 2020, 10:08am

No, I used this:

onmt-main --config data.yml --auto_config --checkpoint_path run/ckpt-15000 export --export_dir model_ctranslate2 --export_format ctranslate2

I’ll try the beam_size.

Nart · July 9, 2020, 10:33am

I added beam_size=4, looks like it got worse.

BLEU+tok+checkpoint = 22.92, 53.4/31.8/22.9/17.4 (BP=0.800, ratio=0.817, hyp_len=1426, ref_len=1745)
BLEU+tok+ctranslate2 = 20.37, 49.4/27.6/19.7/15.2 (BP=0.806, ratio=0.823, hyp_len=1436, ref_len=1745)

guillaumekln · July 9, 2020, 10:40am

That’s strange. What parameters did you use in OpenNMT-tf besides --auto_config?

Nart · July 9, 2020, 10:46am

These are the parameters I am using during the conversion to ctranslate2:

INFO:tensorflow:Using parameters:
data:
  eval_features_file: src-val.txt
  eval_labels_file: tgt-val.txt
  source_vocabulary: src-vocab.txt
  target_vocabulary: tgt-vocab.txt
  train_features_file: src-train.txt
  train_labels_file: tgt-train.txt
eval:
  batch_size: 32
  batch_type: examples
  length_bucket_width: 5
infer:
  batch_size: 32
  batch_type: examples
  length_bucket_width: 5
model_dir: run/
params:
  average_loss_in_time: true
  beam_width: 4
  decay_params:
    model_dim: 512
    warmup_steps: 8000
  decay_type: NoamDecay
  label_smoothing: 0.1
  learning_rate: 2.0
  num_hypotheses: 1
  optimizer: LazyAdam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
score:
  batch_size: 64
train:
  average_last_checkpoints: 8
  batch_size: 3072
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 500000
  maximum_features_length: 100
  maximum_labels_length: 100
  sample_buffer_size: -1
  save_summary_steps: 100