OpenNMT Forum

Suggestions for translating XML

I’m looking into adding XML support to Argos Translate (#23).

The difficult part is that tags in the source sentence need to be placed correctly into the target sentence, ex:

Joey is a <b>good</b> dog!
¡Joey es un <b>buen</b> perro!

This clearly needs to be done by the seq2seq model because words within a tag need to be translated in the context of the surrounding words.

I’ve tried writing some code to normalize tags in the input dataset into a standard format:

Joey likes chasing <i>Sylvìlagus floridanus</i> and <i>Marmota monax</i>.
Joey likes chasing <x>Sylvìlagus floridanus</x> and <x>Marmota monax</x>.

Joey likes <a href="https://www.wegmans.com">store brand dog food</a>.
Joey likes <x>store brand dog food</x>.

Then at inference I could use the standardized tags to place tags in the output. The issue with this is that most of the data I’m currently using for Argos Translate only contains a handful of tags in this format which is likely not sufficient.

My current plan is to try to find/generate more data in this format but any suggestions for better strategies are greatly appreciated!

Reference:

Hello!

My understanding is that this should be handled during the training time; something like the following:

1. Training time:

  • pre-process your training data to replace any form of tags with a special token, say >>tag<<
  • train your model on this data

2. Translation time:

  • Build a list of the tags in the source text to know the order
  • Replace all the tags in the source text with the same special token you used during training
  • Use your ordered list to add the tags back to the target MT output by replacing the special token.

As you said, this might require introducing more tags in your training data; however, as you now use a special token for all tags, you do not have to worry about the exact tag format, i.e. augmenting your data with this special token should be enough.

Hope this helps.

Kind regards,
Yasmin

2 Likes

Thanks for the response @ymoslem! Any suggestions for generating training data?

Since there currently isn’t much valid tag data in the Opus data I’ve looked at. My best idea so far is to try to combine Opus data with Wiktionary definitions to generate tag data with the tag around a single word.

ex:

Opus:
Joey is scared of swimming.
Joey tiene miedo de nadar.

Wiktionary:
swim -> nadar

Generated data:
Joey is scared of <x>swimming</x>.
Joey tiene miedo de <x>nadar</x>.

I think I also want some data with more than a single word in it though. There’s probably a way to do this with something like LASER but that would be pretty involved.

Thanks,

P.J.

Hi P.J.,

I agree that the approach introduced by Hanneman and Dinu (2020) is worth trying. More explanation of it can be found here.

I do not really think having one word vs multiple words is a big deal. You just make it random to teach the model how to use the tags. You can still add the closing tag one or two words after the defined word, regardless of whether they are matching or not.

One thing to note, your special token must be unique, separated from the word(s), AND cannot be sub-tokenized.

Kind regards,
Yasmin

1 Like

I had missed section 3.1 in the linked paper Pierro pointed out that the paper describes using matching sub translations to generate tag data.

Hi

My experience as user is that heavily tagged files as DITA or HTML, are a pain for MT models. You cannot process them as they are. You need a parser to extract translatable info and the probably another inline parser for the translatable information (that you will use to feed a MT system). You probably can find a parser for DITA files, but not so sure for complex HTML files. And then the proposed tag reduction in tag families for training. (as a variable substituion I guess).

I have seen models just working with tags as words, models with basic approachs (probably as yours) and models that are able to perform very clever tag manipulation (for example changing tag order or even translate translated info (as alt=“XXXX”) inside the tags. But again, all these in very special contexts (translation tools with custom parsers with internal corpus).

You probably can emulate a tag behavior for quotes or parethesis in non tagged corpus, usually there are many instances.

But i guess the usual source for tagged files are translation memories TMX from html or dita segments. Some companies probably have hundred of millions of words (and money) to play with its documentation. (DITAS, HTML or MDs)

What I also see is that the translation industry is trying to move as much as they can to non tagged translation for several reasons (and not only because of the MT but the tools used (i.e. parsers or translations editors).

Well probably I did not help you. :slight_smile:
Have a nice day!
Miguel

1 Like