Translation problem using ctranslate2

aquorio15 · March 31, 2022, 7:15am

I have trained a transformer model and used ctranslate2 to convert the checkpoints in suitable format, but when i try to do the translation i am not getting any translation

Command line: ct2-opennmt-tf-converter --model_path /content/drive/MyDrive/Data/English-Tamil/en-kn/run/avg --output_dir ende_ctranslate2
–src_vocab /content/drive/MyDrive/Data/English-Tamil/en-kn/vocab.SRC
–tgt_vocab /content/drive/MyDrive/Data/English-Tamil/en-kn/vocab.TGT
–model_type Transformer
Error:

I dont understand which is creating the error

guillaumekln · March 31, 2022, 7:23am

Please provide more information:

the training command line and configuration
the tokenization method
how you are running the translation behind the web interface.

Also, did you try other inputs?

aquorio15 · March 31, 2022, 8:02am

onmt-main --model_type Transformer --config data.yaml --auto_config train --with_eval
2.I am using openNMT tokenizer
I am using streamlite interface
Yes it is not translating to the target language

guillaumekln · March 31, 2022, 9:13am

Try running the inference with OpenNMT-tf and see if it is different.

aquorio15 · March 31, 2022, 9:27am

no i am not running into any problems with opennmt-tf

panosk · March 31, 2022, 10:58am

These kind of problems usually arise from mismatches in tokenization, vocabularies and/or input format between training and inference.

ymoslem · March 31, 2022, 5:42pm

Hi Amartya,

Could you please try to test CTranslate2 independently. Please make sure you update your version to the latest before converting the model.

Here is a sample code you can use to test your model. Please change the tokenize and detokenize functions as well as your CTranslate2 model path. Test the model with complete sentences rather than single words.

Kindly revise your whole paths; I agree with Panos there might be something to correct.

import ctranslate2


# Replace with your tokenize function and source tokenization model
def tokenize(input_sentences):
    tokens = [input_sentence.split(" ") for input_sentence in input_sentences]
    return tokens

# Replace with your detokenize function and target tokenization model
def detokenize(outputs):
    translation = [" ".join([t for t in output]) for output in outputs]
    return translation


# Modify the path to the CTranslate2 model directory
model_path = "ctranslate2_model"

source_sentences = ["how are you?", "fine, thanks!", "everything is great.", "I am happy to know that."]

translator = ctranslate2.Translator(model_path, "cpu") # "cpu" or "cuda"

outputs = translator.translate_batch(tokenize(source_sentences), beam_size=5)
translation = detokenize(output[0]["tokens"] for output in outputs)
print(translation)

Kind regards,
Yasmin

dmarin · January 22, 2024, 12:57pm

Hi @aquorio15,

I know this is an old post, but did you find out the root cause of this issue? I’m having the exact same problem, and I can’t figure out the reason. My checkpoints are producing good translations, but after converting them, the exported models generate a repetition of the same random token.

Interestingly enough, the predictions from the exported model have always a length equal to the maximum token length parameter (in my case, that is 256 --mostly repeated-- tokens), regardless of the length of the input sentence.

Thanks in advance.

mayowaosibodu · February 9, 2024, 5:28pm

I’m having this exact same issue. And I just checked- my ctranslate2 model also keeps outputting 256 tokens for every input. The output tokens just seem random however, not really repetitive.

@panosk @ymoslem @guillaumekln Curious what your thoughts are on this.

dmarin · February 9, 2024, 5:48pm

Hi @mayowaosibodu,

I eventually found out the root cause. At least in my case, the reason was that the vocab I was using for training (converted from SentencePiece) did not have the proper tokens at the beginning, as specified in the documentation, that is, <blank>, <s> and </s>. So, CT2 was using another token for marking the EOS, and therefore, never stopping the prediction. You can simply check whether this is your case by looking at your vocab and replacing the third token with </s> if necessary.

By the way, the OpenNMT CLI command/script to build/convert the vocab works well, but in my case, I was doing the conversion within my code, hence the mismatch.

On the other hand, what was really confusing in my case was the fact that checkpoints and saved models were producing correct predictions; I suppose because those tokens were handled internally by the library, while in the case of CT2 models, the correct initial tokens are simply assumed.

Hope this helps.

mayowaosibodu · February 9, 2024, 6:05pm

Thanks for the quick response.

I’ve just looked through my vocab files, and they both have the <blank> , <s> and </s> entries though.

I’m not sure what is responsible for the issue in my case. A mis-assigned EOS token makes perfect sense as one possible reason, but there doesn’t seem to be evidence of that with my model.

dmarin · February 9, 2024, 8:20pm

Have you tried to run inference from a checkpoint or a saved model? If yes, maybe the problem was already in the training.

You probably checked it already but have a look at the CT2 vocabs (saved as .json in the exported folder), they should be consistent with the vocabs used for training.

In any case, this really looks like a vocab issue, wherever it comes from…

mayowaosibodu · February 10, 2024, 8:54pm

Running inference from a checkpoint works fine. Running from a SavedModel also seems to work fine. The issue seems to involve CT2 specifically.

It’s surprising because the difference between my export process for the CT2 and SavedModel versions is literally just --format ctranslate2 in the onmt-main export command. So I wonder where the difference in behaviours is coming from. In the meantime I’ll work with the SavedModel version instead.

PS: I wonder if this has something to do with my use of Pretrained embeddings. Previous ONMT models I’ve successfully converted to CT2 format did not make use of Pretrained embeddings. This model however has Pretrained fasttext embeddings being used for both the source and target languages (specified in data.yaml).

Did your model use Pretrained embeddings?

dmarin · February 12, 2024, 2:45pm

I didn’t use pretrained embeddings. But if the SavedModel versions work well, I would have a look at the exported vocabularies in both the SavedModel and CT2 locations and see if there are any differences.

Another thing you may want to try is to export on best metric during training and compare the exported models this way with the exports you did with the script.