Some time before I’ve changed my vocabulary size to 31999 in sentencepiece, because I was convinced it will speed up the training with AMP (spoiler it didn’t). But now I have two different models, one with 31999 and another with 32000 sentencepiece vocabulary sizes.
If I understand correctly, ONMT-TF adds 1 more token for unk, so the sizes are 32000 and 32001 respectively.
When I try to export 32000 model to ctranslate2, it works fine, but for the 31999 model I get an error
ValueError: Source vocabulary 0 has size 31999 but the model expected a vocabulary of size 32000
I also tried to convert to saved_model format and it works fine.

OpenNMT-tf 2.31.0
Tensorflow 2.10.0
ctranslate2 3.9.1’

Maybe the model config would be useful as well

  save_checkpoints_steps: 250
  keep_checkpoint_max: 100
  save_summary_steps: 50
  effective_batch_size: null
  batch_size: 16384
  average_last_checkpoints: 8
  sample_buffer_size: 100000000

  steps: 250
    - BLEU
    - chrf
  export_on_best: BLEU
    metric: bleu
    min_improvement: 0.1
    steps: 10

  learning_rate: 1.0
model_dir:  *model_path*
  train_features_file:  *training_features_path*
  train_labels_file: *training_labels_path*
  eval_features_file:  *eval_feature_path*
  eval_labels_file:  *eval_labels_path*
  source_vocabulary: *vocab_path*
  target_vocabulary:  *vocab_path*
    type: OpenNMTTokenizer
      mode: none
      sp_model_path: *model_path*

    type: OpenNMTTokenizer
      mode: none
      sp_model_path: *model_path*
Most likely the <unk> token is included the vocabulary file. Can you check that?

As you mentioned, OpenNMT-tf automatically adds an entry for out of vocabulary tokens (i.e. all the tokens that do not appear in the vocabulary file).

If the vocabulary file contains <unk>, the CTranslate2 converter will ignore the additional entry causing the size mismatch.

A quick workaround is to rename the <unk> token in the vocabulary to something else like <unknown>.

Yes, I did check, there’s indeed 2 <unk>, one from spm and one added by OpenNMT-tf vocab bulding.
However I still don’t understand how it happens that with different vocab sizes I get different results. Is there a hardcoded size of 32000 for ctranslate conversion?
I mean if the token gets removed both cases should result in a mismatch.

Can you share the 2 vocabularies?

Unfortunately sharing the vocabs is not possible, since they are proprietary.
But the imprint is the following


and one of them have 32000 length, and another 32001 (after onmt, before it’s 31999 and 32000 respectively).
Is there any way I can give you more context without sharing the whole thing?

@guillaumekln renaming <unk> to <blank> totally worked