Markup with translation

double · December 28, 2023, 9:28pm

Is there a way to use markup in translation with ctranslate2.

Example en->es:

Tomorrow <b>you</b> will go to school.
->
Mañana <b>tú</b> irás a la escuela.

Best wishes!
Marcus

miguelknals · December 29, 2023, 9:54am

Hi
I think is not a matter of ctranslate2, as the only think it does is to query your MT model. So, the question is more wide, and it is how the MT can handle these tags.

I have seen MT that have been feed with tags as they are, but honestly, with very bad results. You need to handle them. I think if you type in google “translate with markup” you can find useful papers (as How Should Markup Tags Be Translated? or Document Translation with Markup Reinsertion If you ask to chatgpt, “How do you train a machine translation model with setences with markup” will provide more clues.

In general, I would say these sentences do not come out of the blue, they are part of a document, so first you need a parser to extract the translatable test from the document itsel, and secondly you need to handle these segments and third you call the MT. The second paper i mention here is interesting as uses the Okapi framwork, widely used.

A common practice is do your best removing or handling these tags, and later human can reformat the text to fix whatver you have not been able to fix.

And finally, a trend I have seen is generate documents and content thinking in MT, so more and more, the stream is to provide texts without tags or very simple tags (translate with MT a dita file or an html is a nightmare).
Hope this helps!
M

double · December 29, 2023, 10:44am

@miguelknals
Thanks for your detailed answer!

double · December 29, 2023, 10:47am

@miguelknals
Is there a way to mark the token “” as non-translatable?
Best wishes

miguelknals · December 30, 2023, 12:04pm

@double Srry, but not sure about this. As far as I can see

<b>you</b>

for the MT is just one token (if this is the MT input), This token probably will be left as in the english in the spanish transaltion because is an UNK tag.

If your divide the tokens as in:

Tomorrow <b> you </b> will go to school.

It is highly probable what is inside the tags, it is translated and tags kept as in eng, but, there is no warranty to get same translation as the sentence without the tags, as the tags will interfere in the translation.

Botton line is unknown tokens, usually are ketp as they are, but, no, I dont know how specifcally to mark a token as “untranslatable” or make the MT search ignore them. Srry.
M

double · December 30, 2023, 12:21pm

@miguelknals
Thanks for your generous help!
Best wishes!