CTranslate2 model crash with huge set of segments to predict

Sasanita · September 16, 2022, 3:48pm

Hi all!
I have a ctranslate2 model. I’m trying to predict a big file >50k sentences.
Each time I get the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA failed with error an illegal memory access was encountered
Aborted (core dumped)

I create the Translator object and use translate_batch function. My input tokens looks like the following:

Sources encoded [['｟GEN｠', '￭', 'grb'], ['｟GEN｠', '￭', 'dvostruka'], ['｟GEN｠', '￭', 'te'], ['｟GEN｠', '￭', 'mala'], ['｟GEN｠', '￭', 'svijeća'], ['｟GEN｠', '￭', 'postoji'], ['｟GEN｠', '￭', 'aktivnosti'], ['｟GEN｠', '￭', 'prigovor'], ['｟GEN｠', '￭', 'više'], ['｟GEN｠', '￭', 'manje']]```

I use the following config:

{'batch_type': 'tokens',
                     'beam_size': 1,
                     'length_penalty': 0.9,
                     'max_batch_size': 512,
                     'max_decoding_length': 256,
                     'replace_unknowns': True,
                     'sampling_topk': 10},
 'translator': {'inter_threads': 2, 'intra_threads': 4}}

Do you have any idea of what can be wrong with it? Thank you in advance

guillaumekln · September 16, 2022, 3:53pm

Hi,

What is the CTranslate2 version you are using?

Sasanita · September 19, 2022, 6:46am

The version is 2.23.0

guillaumekln · September 19, 2022, 8:23am

I’m not reproducing this error.

Can you give more information about the model? Training framework, training options, vocabulary size, etc.

Sasanita · September 19, 2022, 8:44am

Hi again.
I trained the model using the following configuration:

Runner config:

  architecture: TransformerBigRelative
  mixed_precision: true
  num_gpus: 2
  training_dir: gen_hren

Opennmt config:

data:
  eval_features_file: data_src.valid.tok
  eval_labels_file: data_tgt.valid.tok
  source_vocabulary: bpe_src.vocab
  target_vocabulary: bpe_tgt.vocab
  train_features_file:
  - src_1.train.tok
  - src_2.train.tok
  - src_3.train.tok
  - src_4.train.tok
  - src_5.train.tok
  train_files_weights:
  - 0.34
  - 0.22
  - 0.11
  - 0.22
  - 0.11
  train_labels_file:
  - 1.train.tok
  - 2.train.tok
  - 3.train.tok
  - 4.train.tok
  - 5.train.tok
eval:
  batch_size: 32
  batch_type: examples
  early_stopping:
    metric: bleu
    min_improvement: 0.01
    steps: 3
  external_evaluators: BLEU
  length_bucket_width: 5
  save_eval_predictions: false
  steps: 5000
infer:
  batch_size: 32
  batch_type: examples
  bucket_width: 5
  length_bucket_width: 5
model_dir: model_base_opennmt
params:
  beam_width: 2
  contrastive_learning: false
  coverage_penalty: 0
  decay_params:
    model_dim: 1024
    warmup_steps: 8000
  decay_type: NoamDecay
  decoding_subword_token: "\uFFED"
  decoding_subword_token_is_spacer: false
  label_smoothing: 0.1
  learning_rate: 1.0
  length_penalty: 0.6
  max_margin_eta: 0.1
  maximum_decoding_length: 256
  num_hypotheses: 1
  optimizer: Adam
  optimizer_params:
    beta_1: 0.9
    beta_2: 0.998
train:
  average_last_checkpoints: 8
  batch_size: 4096
  batch_type: tokens
  effective_batch_size: 25000
  keep_checkpoint_max: 8
  length_bucket_width: 1
  max_step: 200000
  maximum_features_length: 256
  maximum_labels_length: 256
  mixed_precision: true
  moving_average_decay: 0.9999
  replace_unknown_target: true
  sample_buffer_size: 500000
  save_checkpoints_steps: 1000
  save_summary_steps: 200
  single_pass: false

guillaumekln · September 19, 2022, 8:48am

What are the vocabulary sizes?

Sasanita · September 19, 2022, 8:51am

wc -l bpe_*
  48001 bpe_src.model
  47775 bpe_src.vocab
  39099 bpe_tgt.model
  34607 bpe_tgt.vocab

Trained with the limit of 48k tokens.

guillaumekln · September 19, 2022, 9:31am

I can now reproduce using this vocabulary size.

Can you try setting inter_threads to 1 and see if it works?

Sasanita · September 19, 2022, 9:49am

I confirm that using inter_threads=1 the same error appears.

guillaumekln · September 19, 2022, 12:55pm

There is a memory error in the random sampling module. The error will be fixed with this change:

Thank you for reporting the issue!

guillaumekln · October 4, 2022, 12:00pm

FYI, the new version 2.24.0 includes this fix.