CTranslate2 model crash with huge set of segments to predict

Hi all!
I have a ctranslate2 model. I’m trying to predict a big file >50k sentences.
Each time I get the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA failed with error an illegal memory access was encountered
Aborted (core dumped)

I create the Translator object and use translate_batch function. My input tokens looks like the following:

Sources encoded [['⦅GEN⦆', '■', 'grb'], ['⦅GEN⦆', '■', 'dvostruka'], ['⦅GEN⦆', '■', 'te'], ['⦅GEN⦆', '■', 'mala'], ['⦅GEN⦆', '■', 'svijeća'], ['⦅GEN⦆', '■', 'postoji'], ['⦅GEN⦆', '■', 'aktivnosti'], ['⦅GEN⦆', '■', 'prigovor'], ['⦅GEN⦆', '■', 'više'], ['⦅GEN⦆', '■', 'manje']]```

I use the following config:

{'batch_type': 'tokens',
                     'beam_size': 1,
                     'length_penalty': 0.9,
                     'max_batch_size': 512,
                     'max_decoding_length': 256,
                     'replace_unknowns': True,
                     'sampling_topk': 10},
 'translator': {'inter_threads': 2, 'intra_threads': 4}}

Do you have any idea of what can be wrong with it? Thank you in advance

Hi,

What is the CTranslate2 version you are using?

The version is 2.23.0

I’m not reproducing this error.

Can you give more information about the model? Training framework, training options, vocabulary size, etc.

Hi again.
I trained the model using the following configuration:

Runner config:

  architecture: TransformerBigRelative
  mixed_precision: true
  num_gpus: 2
  training_dir: gen_hren

Opennmt config:

data:
  eval_features_file: data_src.valid.tok
  eval_labels_file: data_tgt.valid.tok
  source_vocabulary: bpe_src.vocab
  target_vocabulary: bpe_tgt.vocab
  train_features_file:
  - src_1.train.tok
  - src_2.train.tok
  - src_3.train.tok
  - src_4.train.tok
  - src_5.train.tok
  train_files_weights:
  - 0.34
  - 0.22
  - 0.11
  - 0.22
  - 0.11
  train_labels_file:
  - 1.train.tok
  - 2.train.tok
  - 3.train.tok
  - 4.train.tok
  - 5.train.tok
eval:
  batch_size: 32
  batch_type: examples
  early_stopping:
    metric: bleu
    min_improvement: 0.01
    steps: 3
  external_evaluators: BLEU
  length_bucket_width: 5
  save_eval_predictions: false
  steps: 5000
infer:
  batch_size: 32
  batch_type: examples
  bucket_width: 5
  length_bucket_width: 5
model_dir: model_base_opennmt
params:
  beam_width: 2
  contrastive_learning: false
  coverage_penalty: 0
  decay_params:
    model_dim: 1024
    warmup_steps: 8000
  decay_type: NoamDecay
  decoding_subword_token: "\uFFED"
  decoding_subword_token_is_spacer: false
  label_smoothing: 0.1
  learning_rate: 1.0
  length_penalty: 0.6
  max_margin_eta: 0.1
  maximum_decoding_length: 256
  num_hypotheses: 1
  optimizer: Adam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
train:
  average_last_checkpoints: 8
  batch_size: 4096
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 200000
  maximum_features_length: 256
  maximum_labels_length: 256
  mixed_precision: true
  moving_average_decay: 0.9999
  replace_unknown_target: true
  sample_buffer_size: 500000
  save_checkpoints_steps: 1000
  save_summary_steps: 200
  single_pass: false

What are the vocabulary sizes?

wc -l bpe_*
  48001 bpe_src.model
  47775 bpe_src.vocab
  39099 bpe_tgt.model
  34607 bpe_tgt.vocab

Trained with the limit of 48k tokens.

I can now reproduce using this vocabulary size.

Can you try setting inter_threads to 1 and see if it works?

I confirm that using inter_threads=1 the same error appears.

There is a memory error in the random sampling module. The error will be fixed with this change:

Thank you for reporting the issue!

2 Likes

FYI, the new version 2.24.0 includes this fix.

1 Like