OpenNMT Forum

Getting <UNK> tokens in alternative at position

Hi folks,

I am trying to run ‘Alternative at position’ the decoding feature of Ctranslate2 but getting in the translated alternative sentences. Could you please tell me where i am doing mistake and how I can resolve this issue.


import ctranslate2
translator = ctranslate2.Translator("ende_ctranslate2/")
Input = "This project is geared towards efficient serving of standard translation models but is also a place for experimentation around model compression and inference acceleration."

def tokenize(data):
       return data.split(" ")

def detokenize(data):
      return " ".join(data)

results = translator.translate_batch(
          target_prefix=[tokenize("Dieses Prokekt ist auf die")],

for hypothesis in results[0]:

Output after run about script

I tried with different input english sentences and getting this result.


The pretrained model ende_ctranslate2 was trained on data tokenized with SentencePiece. So the definition of tokenize and detokenize are not valid.

You can get the SentencePiece model here: and follow the instructions in this page to apply it:

1 Like

Hi @guillaumekln,

Thanks for your reply.

I am sorry but I did not clearly understand what you are trying to say.

I download the wmt-ende-sp data and get two files ‘wmtende.model’ and ‘wmtende.vocab’ files. So should I include this two files in my above mentioned script?

Also, I have a look on the page that you shared, but unfortunately i did not get as much. I feel the instruction that is mentioned on the page is create .model and .vocab file but I already have these two files and i need to run.

Could please explain more what exactly i have to do for ‘alternative at position’. I am really sorry to bother you because I am very new user and dont have good understanding about these.


First install SentencePiece with:

pip install sentencepiece

and then update your tokenization functions:

import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='wmtende.model')

def tokenize(data):
    return sp.encode(data, out_type=str)

def detokenize(data):
    return sp.decode(data)
1 Like

Hi @guillaumekln

Thank you so much. The problem is solved and it is working fine.