Memory Issue with CTranslate2 and <unk> token

Hi,

I’ve trained an OpenNMT model which is working perfectly. However, after being exported to CTranslate2, I’m having a memory issue on prediction (GPU) when the sentence has the <unk> token (also happens if it contains a non-existing token)
The token <unk> exists in the vocabulary, and the tokenization sent to the model is correct (replaces non-existing text by <unk>)

The specific message received is: “Segmentation fault (core dumped)”

CTranslate2 parameters used:

translator:
  inter_threads: 2
  intra_threads: 4

translate_batch:
  max_batch_size: 1024
  batch_type: tokens
  beam_size: 5
  length_penalty: 0.6
  max_decoding_length: 256

Any idea of what could be happening?

Thanks for your help,
Daniel

Hi,

Does it happen only on GPU and only when the sentence has an unknown token? Or do you get the same error in other cases?

Hi Guillaume,

It also crashes with CPU, and when it has a token out-of-vocab or the <unk> token.
Same exact error in all these scenarios.

Thanks

I quickly tried with the pretrained English-German and using <unk> or OOV tokens did not make any error. So I would need more information to understand what is going on:

  • Which CTranslate2 version are you using?
  • Is the model coming from OpenNMT-py or OpenNMT-tf?
  • Can you post the tokenized input that makes the error?

We trained other similar models not having this issue.
Actually, we tried the same problematic input on another CTranslate2 model with the same source but different target lang, and it worked.

Ctranslate2 version: 1.16.2
OpenNMT-tf TransformerBig

Problematic input example:

  • Original: <bpt i="1" type="439" x="49"/>Hello➔World.
  • Normalized + Tokenized: ▁ ⦅BPT⦆ H ello <unk> World .

This works (removing XML tag):

  • Original: Hello➔World.
  • Norm + Tok: ▁ H ello <unk> World .

This also works (removing arrow character):

  • Original: <bpt i="1" type="439" x="49"/>Hello World.
  • Norm + Tok: ▁ ⦅BPT⦆ H ello ▁World .

Is it possible for you to share this model privately?

I don’t know if I’m allowed yet, but what would be the way to do that privately?

You could temporarily host it somewhere (e.g. Google Drive) and send me a link in a private message.

(Note that without the SentencePiece model, the translation model is basically unusable. So it’s quite safe to share only the translation model.)