I am trying to run ‘Alternative at position’ the decoding feature of Ctranslate2 but getting in the translated alternative sentences. Could you please tell me where i am doing mistake and how I can resolve this issue.
translator = ctranslate2.Translator("ende_ctranslate2/")
Input = "This project is geared towards efficient serving of standard translation models but is also a place for experimentation around model compression and inference acceleration."
return data.split(" ")
return " ".join(data)
results = translator.translate_batch(
target_prefix=[tokenize("Dieses Prokekt ist auf die")],
for hypothesis in results:
Output after run about script
I tried with different input english sentences and getting this result.
The pretrained model
ende_ctranslate2 was trained on data tokenized with SentencePiece. So the definition of
detokenize are not valid.
You can get the SentencePiece model here: https://s3.amazonaws.com/opennmt-trainingdata/wmt_ende_sp_model.tar.gz and follow the instructions in this page to apply it:
Thanks for your reply.
I am sorry but I did not clearly understand what you are trying to say.
I download the wmt-ende-sp data and get two files ‘wmtende.model’ and ‘wmtende.vocab’ files. So should I include this two files in my above mentioned script?
Also, I have a look on the page that you shared, but unfortunately i did not get as much. I feel the instruction that is mentioned on the page is create .model and .vocab file but I already have these two files and i need to run.
Could please explain more what exactly i have to do for ‘alternative at position’. I am really sorry to bother you because I am very new user and dont have good understanding about these.
First install SentencePiece with:
pip install sentencepiece
and then update your tokenization functions:
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file='wmtende.model')
return sp.encode(data, out_type=str)
Thank you so much. The problem is solved and it is working fine.