Handling single word 1 to many translation

SamuelLacombe · November 5, 2021, 2:14am

Hello,

I have been adding some glossary in my segments to train my model in order that it’s learn to translate single words and not just sentences. So far, I have been removing double occurrence of the same source word. So if 1 source word would have more than 1 translation, only one would be kept.

Would this cause any issue in the training to leave the “1 to many” in there? I guess all of them would get higher weight?

Let me know your hints…!

Best regards,
Samuel

ymoslem · November 5, 2021, 6:50pm

Hi Samuel!

I would leave them! The model will find the right weights from the whole corpus anyhow. The main purpose of adding one word segments is as you said helping the model translate shorter sentences.

You can also add start and end tokens to the source, as Guillaume suggested here:

You can also use a dictionary. For example, if your target language is supported by WordNet, you can have something like this:

from nltk.corpus import wordnet
for word in wordnet.synsets('parapluie', lang="fra"):
    word_name = "\n• " + word.name()[:-5].replace("_", " ")
    word_pos = word.pos()
    print(word_name + " ("+ word_pos + ")")
    print(word.definition())
    examples = word.examples()
    if len(examples) > 0:
        print("» Examples:", *examples, sep="\n* ")

Output:

• umbrella (n)
having the function of uniting a group of similar things
» Examples:

the Democratic Party is an umbrella for many liberal groups

under the umbrella of capitalism

• umbrella (s)
covering or applying simultaneously to a number of similar items or elements or groups
» Examples:

an umbrella organization

umbrella insurance coverage

• umbrella (n)
a formation of military planes maintained over ground operations or targets
» Examples:

an air umbrella over England

• umbrella (n)
a lightweight handheld collapsible canopy

Kind regards,
Yasmin

SamuelLacombe · November 6, 2021, 12:25am

Hello Yasmin,

Thank you for the reply it was triple helpful!