Respect the format of a text

Jourdelune · February 6, 2022, 1:30pm

Hello to all!
I have a problem with translation in general, I ask my question here because I’m not sure where I can ask it (it’s a general question).

In my case of use, it is necessary to respect the format of the text during the translation, that is to say the line break, to keep the emojis, not to change the elements of style like the bold marked by ** or the underline (__), finally some characters are unknown and marked by (for example: ▬).

For the line break, I split the text (text.split("\n")) and translate each piece individually, for the emojis they are translated by ?? so I transform the emojis into numbers (for example 123), but isn’t there a cleaner solution for that, like keeping the emojis unicode, adding the unknown characters and making the style element like ** stay stuck, because they are split from time to time.

I know this question is not necessarily related to openNMT but I didn’t find any resource dealing with this topic.

This resource talks about the subject but does not talk about unknown characters and style tools: Restoring of source formatting - #13 by oraveczcsaba

ymoslem · February 6, 2022, 3:39pm

Hello!

Great questions. Let me split them into multiple sections: 1) respecting paragraphs; 2) placing back Unicode characters (or other untranslatables) into the translation.

1. Respecting paragraph format

I used to do as you said. Eventually, I wrote these two helper functions; paragraph_tokenizer stores newline indices of the source text, and paragraph_detokenizer restores the newlines back into the translation text.

from nltk import sent_tokenize


def paragraph_tokenizer(text):
    """ Replace sentences with their indexes, and store indexes of newlines
    Args:
        text (str): Text to be indexed

    Returns:
        sentences (list): List of sentences
        breaks (list): List of indexes of sentences and newlines
    """
    text = text.strip()
    paragraphs = text.splitlines(True)

    breaks = []
    sentences = []

    for paragraph in paragraphs:
        if paragraph == "\n":
            breaks.append("\n")
        else:
            paragraph_sentences = sent_tokenize(paragraph)
            breaks.extend(list(range(len(sentences),  + len(sentences)+len(paragraph_sentences))))
            breaks.append("\n")
            sentences.extend(paragraph_sentences)

    # Remove the last newline
    breaks = breaks[:-1]

    return sentences, breaks


def paragraph_detokenizer(sentences, breaks):
    """Restore the original pharagraph format from the indexes of sentences and newlines

    Args:
        sentences (list): List of sentences
        breaks (list): List of indexes of sentences and newlines

    Returns:
        text (str): Text with original format
    """
    output = []

    for br in breaks:
        if br == "\n":
            output.append("\n")
        else:
            output.append(sentences[br] + " ")

    text = "".join(output)
    return text

You can test the functions as follows.

text = """
First paragraph. This is the first sentence. This is the second sentence. This is the third sentence.
Second paragraph. This is the first sentence. This is the second sentence. This is the third sentence.



Third paragraph. This is the first sentence. This is the second sentence. This is the third sentence.

Fourth paragraph. This is the first sentence. This is the second sentence. This is the third sentence.
Fifth paragraph. This is the first sentence. This is the second sentence. This is the third sentence.


"""

sentences, breaks = paragraph_tokenizer(text)
print(len(sentences))
print(breaks)

formatted_text = paragraph_detokenizer(sentences, breaks)
print(formatted_text)

2. Copy Unicode Characters or Untranslatables

This consists of a few requirements:

While training a SentencePiece model, activate the option --byte_fallback to decompose unknown pieces into UTF-8 byte pieces.
Then, in whatever MT inference engine you use, activate the option that replaces an unknown target token by the source token with the highest attention. In CTranslate2, enable replace_unknowns to copy these to the target.
For #1 and #2 to work, these characters/untranslatables must be in your training data (either during the original training, fine-tuning, or both).
If you have sophisticated untranslatables (e.g. tags, codes, etc.), you might need to run a pre-processing step to replace them with a simpler token like <1> or DONT1 incrementing the number, and a post-processing step to replace them back. You can find more about the topic in this paper.

Just find-replace “▬” with nothing. If the word is unknown and copied, this is good. Improving the translation is more about the quality of the trained model.

The aforementioned process should help. Start with #1 and #2.

You might add “**” as a special token to your SentencePiece model with the option --user_defined_symbols. Still, this might affect the quality of the translation if they are not seen in the training data, so you have to experiment with and without it.

I hope this helps.

Kind regards,
Yasmin

Jourdelune · February 6, 2022, 4:49pm

Hi, thank you for your very complete answer

ymoslem · February 6, 2022, 5:36pm

I tried the following:

import sentencepiece as spm

sp = spm.SentencePieceProcessor("spm.128k.model")

print(sp.encode("**"))
print(sp.encode(":)"))
print(sp.encode("😀"))

Output:

[16245]
[3825]
[34192]

print(sp.encode("**Merci beaucoup** 🙏"))

Output:

[16245, 119195, 27856, 28720, 14005, 119132, 124407]

So, all these tokens are already recognized by the SentencePiece model. Note that if ** is stick to the text, the starting one will be encoded as “▁**” while the closing one will be encoded as “**”.

How they are translated is a completely different issue. As the model is not specifically trained for handling markup, sometimes it will work; sometimes, it will not. So if this is really critical, you can either try: 1) some pre-processing/post-processing ideas; or 2) fine-tuning the model on markup text.

Kind regards,
Yasmin

Jourdelune · February 6, 2022, 6:08pm

ah yes indeed, here is the code I used to train it but well it is already recognized


spm.SentencePieceTrainer.train(input='data_dict.128k.txt', model_prefix='model', vocab_size=128000, user_defined_symbols=['__', '**'], byte_fallback=True, model_type="bpe")

sp = spm.SentencePieceProcessor(model_file='model.model')

print(sp.encode_as_pieces('Hello world'))

so I would have to refine this model on these tags, however it is a multilanguage model…
in preprocessing for emojis I replace them with another character however often this character ▬ appears in large numbers in a row, so the ia ends up repeating the character too many times

tel34 · February 7, 2022, 7:10am

Very interesting post!

ymoslem · February 7, 2022, 1:52pm

If you want to go down this path, you can fine-tune the model on only the languages you care about most.

As Guillaume suggested, it might help to increase the beam size. Also, try to use repetition_penalty with a float value above 1. It helps a bit; sometimes, it prevents exact repetitions, but still does not prevent repetitions with a similar meaning.

As this mostly happens with smaller source texts, here is a trick that might help. I am not sure how the original training data was prepared, but I noticed that if you add sort of start and end tokens at the inference time, this can help with short sentences (so you might only limit this to a specific source length).

translator.translate_batch([["▁", "▁Merci", "▁"]], target_prefix=[["__en__"]])

Without the extra “▁” tokens, it is translated as “Thank you thanks”. With them, it is translated as “Thanks”. Obviously, you might get these “▁” tokens in the translation, and you will have to get rid of them with simple find-replace.

These tests were with the M2M-100 418M-parameter model. If you want to try the 1.2B-parameter model, I have converted both models into the CTranslate2 formate, and uploaded here.

Thanks a lot to @guillaumekln for supporting the M2M-100 model, and to you @Jourdelune for sharing your interesting experiments.

Kind regards,
Yasmin

Jourdelune · February 7, 2022, 5:13pm

I think repetition_penalty doesn’t exist anymore:

File "/root/TranslationBot/Api/ApiTest.py", line 78, in <lambda>
    prediction = await run_in_threadpool(lambda: model_class.translate(item.message, item.input, item.output))
  File "/root/TranslationBot/Api/ApiTest.py", line 49, in translate
    results = self.translator.translate_batch([tokens], target_prefix=[["%s" % output]], repetition_penalty=1, return_scores=False, beam_size=2, replace_unknowns=True)
TypeError: translate_batch(): incompatible function arguments. The following argument types are supported:
    1. (self: ctranslate2.translator.Translator, source: List[List[str]], target_prefix: Optional[List[Optional[List[str]]]] = None, *, max_batch_size: int = 0, batch_type: str = 'examples', asynchronous: bool = False, beam_size: int = 2, num_hypotheses: int = 1, length_penalty: float = 0, coverage_penalty: float = 0, prefix_bias_beta: float = 0, allow_early_exit: bool = True, max_decoding_length: int = 250, min_decoding_length: int = 1, use_vmap: bool = False, normalize_scores: bool = False, return_scores: bool = False, return_attention: bool = False, return_alternatives: bool = False, sampling_topk: int = 1, sampling_temperature: float = 1, replace_unknowns: bool = False) -> Union[List[ctranslate2.translator.TranslationResult], List[ctranslate2.translator.AsyncTranslationResult]]

Invoked with: <ctranslate2.translator.Translator object at 0x7fcd1854be70>, [['pt', '▁Ce', '▁greu', '▁imi', '▁e', '▁fara', '▁black', '▁des', 'ert', '▁123']]; kwargs: target_prefix=[['en']], repetition_penalty=1, return_scores=False, beam_size=2, replace_unknowns=True

but is it in the doc:

In my case I have to translate for example: “Hello ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬”
The repetition is still done despite the addition of an empty token at the beginning.
I can’t refine for only one language because in my use case (live translation on a social network) any language can be used.

And indeed huge thanks to guillaumekln for ctranslate2 (which saves my life lol) and m2m_100 support.

ymoslem · February 7, 2022, 7:26pm

This should be a float number higher than 1, e.g. 1.2

Kind regards,
Yasmin

Jourdelune · February 7, 2022, 8:50pm

Yes indeed, however the result is the same (I had put 1 to test as I had the error with floating numbers).
The complete error:

[2022-02-07 20:48:34 +0000] [80492] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/uvicorn/protocols/http/httptools_impl.py", line 376, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/usr/local/lib/python3.9/dist-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/fastapi/applications.py", line 211, in __call__
    await super().__call__(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/usr/local/lib/python3.9/dist-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/usr/local/lib/python3.9/dist-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.9/dist-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/usr/local/lib/python3.9/dist-packages/fastapi/routing.py", line 226, in app
    raw_response = await run_endpoint_function(
  File "/usr/local/lib/python3.9/dist-packages/fastapi/routing.py", line 159, in run_endpoint_function
    return await dependant.call(**values)
  File "/root/TranslationBot/Api/ApiTest.py", line 78, in run_prediction
    prediction = await run_in_threadpool(lambda: model_class.translate(item.message, item.input, item.output))
  File "/usr/local/lib/python3.9/dist-packages/starlette/concurrency.py", line 39, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/usr/local/lib/python3.9/dist-packages/anyio/to_thread.py", line 28, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(func, *args, cancellable=cancellable,
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 818, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.9/dist-packages/anyio/_backends/_asyncio.py", line 754, in run
    result = context.run(func, *args)
  File "/root/TranslationBot/Api/ApiTest.py", line 78, in <lambda>
    prediction = await run_in_threadpool(lambda: model_class.translate(item.message, item.input, item.output))
  File "/root/TranslationBot/Api/ApiTest.py", line 49, in translate
    results = self.translator.translate_batch([tokens], target_prefix=[["__%s__" % output]], return_scores=False, beam_size=2, repetition_penalty=1.2, replace_unknowns=True)
TypeError: translate_batch(): incompatible function arguments. The following argument types are supported:
    1. (self: ctranslate2.translator.Translator, source: List[List[str]], target_prefix: Optional[List[Optional[List[str]]]] = None, *, max_batch_size: int = 0, batch_type: str = 'examples', asynchronous: bool = False, beam_size: int = 2, num_hypotheses: int = 1, length_penalty: float = 0, coverage_penalty: float = 0, prefix_bias_beta: float = 0, allow_early_exit: bool = True, max_decoding_length: int = 250, min_decoding_length: int = 1, use_vmap: bool = False, normalize_scores: bool = False, return_scores: bool = False, return_attention: bool = False, return_alternatives: bool = False, sampling_topk: int = 1, sampling_temperature: float = 1, replace_unknowns: bool = False) -> Union[List[ctranslate2.translator.TranslationResult], List[ctranslate2.translator.AsyncTranslationResult]]

Invoked with: <ctranslate2.translator.Translator object at 0x7f7f85cb0630>, [['__en__', '▁123', '▁**', 'M', 'IG', 'HT', 'Y', '▁MIN', 'ION', '▁SOL', '▁G', 'IV', 'EA', 'W', 'AY', '▁#', '15', '!', '**', '▁123', '▁@', 'e', 'very', 'one', '▁123', '123', '**', 'PR', 'IZ', 'E', ':', '▁$', '20', '▁SOL', '**', '▁123', '▁123', '123', '▁1.', '▁RE', 'ACT', '▁123', '▁to', '▁this', '▁G', 'IV', 'EA', 'W', 'AY', '▁P', 'OST', '123', '▁2.', '▁I', '▁will', '▁do', '▁a', '▁random', '▁sc', 'roll', '▁and', '▁choose', '▁1', '▁person', '!']]; kwargs: target_prefix=[['__hi__']], return_scores=False, beam_size=2, repetition_penalty=1.2, replace_unknowns=True

guillaumekln · February 8, 2022, 8:37am

You should probably upgrade CTranslate2 to a newer version.

Jourdelune · February 9, 2022, 7:30pm

Hi, sorry for the late reply these last days were very busy, I updated ctranslate2 and indeed it works better :), however I still do not have the same number of characters that resort. I will try to modify the tokenizer to see what it gives.

Jourdelune · February 9, 2022, 7:58pm

I tested an approach that groups the sequence of characters and transforms it into a single character, then once translated it replaces this character with the following one, that allows me to pass from:

__**Welcome to the Interaction support server**__

**InteractionBot is a translator discord bot easy to use with a dashboard and a customizable behavior.**


🔗 web site and dashboard  
 Bot invite <lol>
 📥 Support channel: 

**▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬**

to

___ Bienvenue sur le serveur de support de l'interaction**__

**InteractionBot est un bot de discorde de traducteur facile à utiliser avec un panneau de bord et un comportement personnalisable.**


🔗 sites Web et dashboard
L’invitation <lol>
📥 Channels de support :

* ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ *

but the tags ** are not kept, which is important

Jourdelune · February 10, 2022, 7:34am

By dint of thinking I begin to have ideas, I said to myself why not break down the text to recover only the parts to be translated and replace them by the translated text, it is necessary of course that I replace the emojis by the special character

import markdown
from bs4 import BeautifulSoup 

def md_to_text(md):
    html = markdown.markdown(md)
    soup = BeautifulSoup(html, features='html.parser')
    return soup.get_text()

text1 = """
__**Welcome to the Interaction support server**__

**InteractionBot is a translator discord bot easy to use with a dashboard and a customizable behavior.**


🔗 web site and dashboard
 Bot invite <lol>
 📥 Support channel:

**▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬**
"""

text = md_to_text(text1)
text = text.split("\n")
for i in text:
    text1 = text1.replace(i, translate(i))

print(text1)

ymoslem · February 10, 2022, 3:40pm

Good idea in general, just keep in mind:

How will you handle the markdown if it is in the middle, not at the end and start of a segment?
No need to translate segment by segment, as CTranslate2 can translate a list of sentences at the same time, which is more efficient.

All the best,
Yasmin

ymoslem · February 11, 2022, 10:09am

I know that @Jourdelune has already adjusted this, but just for the record for anyone coming across this post. The M2M-100 model requires both a target_prefix and also a source prefix. This improves the translation quality and minimizes unnecessary repetitions.

translator.translate_batch(
    [["__fr__",  "▁Bonjour", "!"], ["__fr__", "▁Merci", "▁beaucoup"]],
    target_prefix=[["__en__"], ["__en__"]],
)

Jourdelune · February 13, 2022, 2:17pm

import requests
import markdown
import json
import re

from bs4 import BeautifulSoup


text1 = """
__**Welcome to the Interaction support server**__

**InteractionBot is a __translator__ discord bot easy to use with a dashboard and a customizable behavior.**


🔗 web site and dashboard
 Bot invite <lol>
 📥 Support channel:

**▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬**
"""

emoji = re.compile(r"|".join((re.escape(c) for c in sorted(json.load(open("data/emoji.json")), key=len, reverse=True))))
BASE_URL = "http://ip"
pattern = "<(.*?)>"

delimit_char = "123"
item = dict()


def translate(message):
    rep = requests.post(f'{BASE_URL}/translation', json={"input": "en", "output": "fr", "message": message})

    return rep.text[1:-1]


def find_emoji(message):
    """Fonction for find unicode emoji."""
    return emoji.findall(message)


def tokenize_msg(message):
    item = dict()

    for i in range(message.count(delimit_char)):
        item[message.find(delimit_char)] = delimit_char
        message = message.replace(delimit_char, delimit_char, 1)

    for i in re.findall(r"((.)\2*)", message):
        for element in i:
            if element == "▬" and len(i[0]) != 1:
                item[message.find(i[0])] = i[0]
                message = message.replace(i[0], delimit_char, 1)

    for i in range(message.count("▬")):
        item[message.find("▬")] = "▬"
        message = message.replace("▬", delimit_char, 1)

    for i in re.findall(pattern, message):
        item[message.find(f"<{i}>")] = f"<{i}>"
        message = message.replace(f"<{i}>", delimit_char, 1)

    for i in find_emoji(message):
        item[message.find(i)] = i
        message = message.replace(i, delimit_char, 1)

    return [message, item]


def detokenize_msg(message, lst):
    index = 0
    for i in range(len(sorted(lst.items()))):
        message = message[:index] + message[index:].replace(delimit_char, sorted(lst.items())[i][1], 1)
        index = message.find(sorted(lst.items())[i][1]) + len(sorted(lst.items())[i][1])

    return message


def clean_md(message):
    html = markdown.markdown(message)
    soup = BeautifulSoup(html, features='html.parser')

    a = soup.get_text("|")
    a = a.split("\n")

    lst_message = message.split("\n")

    for i in lst_message:
        if i.replace(" ", "") == "":
            lst_message.remove(i)

    for i, item in enumerate(a):
        if item[0] == "|" and item[-1] == "|":
            message = message.replace(lst_message[i][2:-2], lst_message[i][2:-2].replace("**", "").replace("__", ""))

    return message, BeautifulSoup(markdown.markdown(message), features='html.parser').get_text()


text1, text = clean_md(text1)
text = text.split("\n")

for i in text:
    if i.replace("\n", "").replace(" ", "") != "":
        item = tokenize_msg(i)
        text1 = text1.replace(i, detokenize_msg(translate(item[0].replace("\n", "")), item[1]))
    else:
        text1 += "\n"

print(text1)

output:

__**Bienvenue sur le serveur d’assistance**__

**InteractionBot est un bot de discorde de traducteur facile à utiliser avec un panneau de bord et un comportement personnalisable.**


🔗 sites Web et dashboard
Bot invite<lol>
📥 Channels de support :

**▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬**

Here is my code, the best would be to waffle the model but I do not have the skills and power for that, here is what I did to try to solve the problem