Error converting model to ctranslate2

Some time before I’ve changed my vocabulary size to 31999 in sentencepiece, because I was convinced it will speed up the training with AMP (spoiler it didn’t). But now I have two different models, one with 31999 and another with 32000 sentencepiece vocabulary sizes.
If I understand correctly, ONMT-TF adds 1 more token for unk, so the sizes are 32000 and 32001 respectively.
When I try to export 32000 model to ctranslate2, it works fine, but for the 31999 model I get an error
ValueError: Source vocabulary 0 has size 31999 but the model expected a vocabulary of size 32000
I also tried to convert to saved_model format and it works fine.

OpenNMT-tf 2.31.0
Tensorflow 2.10.0
ctranslate2 3.9.1’

Maybe the model config would be useful as well


train:
  save_checkpoints_steps: 250
  keep_checkpoint_max: 100
  save_summary_steps: 50
  effective_batch_size: null
  batch_size: 16384
  average_last_checkpoints: 8
  sample_buffer_size: 100000000


eval:
  steps: 250
  external_evaluators:
    - BLEU
    - chrf
  export_on_best: BLEU
  early_stopping:
    metric: bleu
    min_improvement: 0.1
    steps: 10

params:
  learning_rate: 1.0
model_dir:  *model_path*
data:
  train_features_file:  *training_features_path*
  train_labels_file: *training_labels_path*
  eval_features_file:  *eval_feature_path*
  eval_labels_file:  *eval_labels_path*
  source_vocabulary: *vocab_path*
  target_vocabulary:  *vocab_path*
  source_tokenization:
    type: OpenNMTTokenizer
    params:
      mode: none
      sp_model_path: *model_path*

  target_tokenization:
    type: OpenNMTTokenizer
    params:
      mode: none
      sp_model_path: *model_path*
1 Like

Most likely the <unk> token is included the vocabulary file. Can you check that?

As you mentioned, OpenNMT-tf automatically adds an entry for out of vocabulary tokens (i.e. all the tokens that do not appear in the vocabulary file).

If the vocabulary file contains <unk>, the CTranslate2 converter will ignore the additional entry causing the size mismatch.

A quick workaround is to rename the <unk> token in the vocabulary to something else like <unknown>.

Yes, I did check, there’s indeed 2 <unk>, one from spm and one added by OpenNMT-tf vocab bulding.
However I still don’t understand how it happens that with different vocab sizes I get different results. Is there a hardcoded size of 32000 for ctranslate conversion?
I mean if the token gets removed both cases should result in a mismatch.

Can you share the 2 vocabularies?

Unfortunately sharing the vocabs is not possible, since they are proprietary.
But the imprint is the following

<unk>
...
...
<unk>

and one of them have 32000 length, and another 32001 (after onmt, before it’s 31999 and 32000 respectively).
Is there any way I can give you more context without sharing the whole thing?

@guillaumekln renaming <unk> to <blank> totally worked