Some time before I’ve changed my vocabulary size to 31999 in sentencepiece, because I was convinced it will speed up the training with AMP (spoiler it didn’t). But now I have two different models, one with 31999 and another with 32000 sentencepiece vocabulary sizes.
If I understand correctly, ONMT-TF adds 1 more token for unk, so the sizes are 32000 and 32001 respectively.
When I try to export 32000 model to ctranslate2, it works fine, but for the 31999 model I get an error
ValueError: Source vocabulary 0 has size 31999 but the model expected a vocabulary of size 32000
I also tried to convert to saved_model format and it works fine.
OpenNMT-tf 2.31.0
Tensorflow 2.10.0
ctranslate2 3.9.1’
Maybe the model config would be useful as well
train:
save_checkpoints_steps: 250
keep_checkpoint_max: 100
save_summary_steps: 50
effective_batch_size: null
batch_size: 16384
average_last_checkpoints: 8
sample_buffer_size: 100000000
eval:
steps: 250
external_evaluators:
- BLEU
- chrf
export_on_best: BLEU
early_stopping:
metric: bleu
min_improvement: 0.1
steps: 10
params:
learning_rate: 1.0
model_dir: *model_path*
data:
train_features_file: *training_features_path*
train_labels_file: *training_labels_path*
eval_features_file: *eval_feature_path*
eval_labels_file: *eval_labels_path*
source_vocabulary: *vocab_path*
target_vocabulary: *vocab_path*
source_tokenization:
type: OpenNMTTokenizer
params:
mode: none
sp_model_path: *model_path*
target_tokenization:
type: OpenNMTTokenizer
params:
mode: none
sp_model_path: *model_path*