I have trained a transformer model and used ctranslate2 to convert the checkpoints in suitable format, but when i try to do the translation i am not getting any translation
Could you please try to test CTranslate2 independently. Please make sure you update your version to the latest before converting the model.
Here is a sample code you can use to test your model. Please change the tokenize and detokenize functions as well as your CTranslate2 model path. Test the model with complete sentences rather than single words.
Kindly revise your whole paths; I agree with Panos there might be something to correct.
import ctranslate2
# Replace with your tokenize function and source tokenization model
def tokenize(input_sentences):
tokens = [input_sentence.split(" ") for input_sentence in input_sentences]
return tokens
# Replace with your detokenize function and target tokenization model
def detokenize(outputs):
translation = [" ".join([t for t in output]) for output in outputs]
return translation
# Modify the path to the CTranslate2 model directory
model_path = "ctranslate2_model"
source_sentences = ["how are you?", "fine, thanks!", "everything is great.", "I am happy to know that."]
translator = ctranslate2.Translator(model_path, "cpu") # "cpu" or "cuda"
outputs = translator.translate_batch(tokenize(source_sentences), beam_size=5)
translation = detokenize(output[0]["tokens"] for output in outputs)
print(translation)
I know this is an old post, but did you find out the root cause of this issue? I’m having the exact same problem, and I can’t figure out the reason. My checkpoints are producing good translations, but after converting them, the exported models generate a repetition of the same random token.
Interestingly enough, the predictions from the exported model have always a length equal to the maximum token length parameter (in my case, that is 256 --mostly repeated-- tokens), regardless of the length of the input sentence.
I’m having this exact same issue. And I just checked- my ctranslate2 model also keeps outputting 256 tokens for every input. The output tokens just seem random however, not really repetitive.
I eventually found out the root cause. At least in my case, the reason was that the vocab I was using for training (converted from SentencePiece) did not have the proper tokens at the beginning, as specified in the documentation, that is, <blank>, <s> and </s>. So, CT2 was using another token for marking the EOS, and therefore, never stopping the prediction. You can simply check whether this is your case by looking at your vocab and replacing the third token with </s> if necessary.
By the way, the OpenNMT CLI command/script to build/convert the vocab works well, but in my case, I was doing the conversion within my code, hence the mismatch.
On the other hand, what was really confusing in my case was the fact that checkpoints and saved models were producing correct predictions; I suppose because those tokens were handled internally by the library, while in the case of CT2 models, the correct initial tokens are simply assumed.
I’ve just looked through my vocab files, and they both have the <blank> , <s> and </s> entries though.
I’m not sure what is responsible for the issue in my case. A mis-assigned EOS token makes perfect sense as one possible reason, but there doesn’t seem to be evidence of that with my model.
Have you tried to run inference from a checkpoint or a saved model? If yes, maybe the problem was already in the training.
You probably checked it already but have a look at the CT2 vocabs (saved as .json in the exported folder), they should be consistent with the vocabs used for training.
In any case, this really looks like a vocab issue, wherever it comes from…
Running inference from a checkpoint works fine. Running from a SavedModel also seems to work fine. The issue seems to involve CT2 specifically.
It’s surprising because the difference between my export process for the CT2 and SavedModel versions is literally just --format ctranslate2 in the onmt-main export command. So I wonder where the difference in behaviours is coming from. In the meantime I’ll work with the SavedModel version instead.
PS: I wonder if this has something to do with my use of Pretrained embeddings. Previous ONMT models I’ve successfully converted to CT2 format did not make use of Pretrained embeddings. This model however has Pretrained fasttext embeddings being used for both the source and target languages (specified in data.yaml).
I didn’t use pretrained embeddings. But if the SavedModel versions work well, I would have a look at the exported vocabularies in both the SavedModel and CT2 locations and see if there are any differences.
Another thing you may want to try is to export on best metric during training and compare the exported models this way with the exports you did with the script.