OpenNMT

Handling single word 1 to many translation

Hello,

I have been adding some glossary in my segments to train my model in order that it’s learn to translate single words and not just sentences. So far, I have been removing double occurrence of the same source word. So if 1 source word would have more than 1 translation, only one would be kept.

Would this cause any issue in the training to leave the “1 to many” in there? I guess all of them would get higher weight?

Let me know your hints…!

Best regards,
Samuel

Hi Samuel!

I would leave them! The model will find the right weights from the whole corpus anyhow. The main purpose of adding one word segments is as you said helping the model translate shorter sentences.

You can also add start and end tokens to the source, as Guillaume suggested here:

You can also use a dictionary. For example, if your target language is supported by WordNet, you can have something like this:

from nltk.corpus import wordnet
for word in wordnet.synsets('parapluie', lang="fra"):
    word_name = "\n• " + word.name()[:-5].replace("_", " ")
    word_pos = word.pos()
    print(word_name + " ("+ word_pos + ")")
    print(word.definition())
    examples = word.examples()
    if len(examples) > 0:
        print("» Examples:", *examples, sep="\n* ")

Output:

• umbrella (n)
having the function of uniting a group of similar things
» Examples:

  • the Democratic Party is an umbrella for many liberal groups
  • under the umbrella of capitalism

• umbrella (s)
covering or applying simultaneously to a number of similar items or elements or groups
» Examples:

  • an umbrella organization
  • umbrella insurance coverage

• umbrella (n)
a formation of military planes maintained over ground operations or targets
» Examples:

  • an air umbrella over England

• umbrella (n)
a lightweight handheld collapsible canopy

Kind regards,
Yasmin

1 Like

Hello Yasmin,

Thank you for the reply it was triple helpful!

1 Like