Suggestions for translating XML

I’m looking into adding XML support to Argos Translate (#23).

The difficult part is that tags in the source sentence need to be placed correctly into the target sentence, ex:

Joey is a <b>good</b> dog!
¡Joey es un <b>buen</b> perro!

This clearly needs to be done by the seq2seq model because words within a tag need to be translated in the context of the surrounding words.

I’ve tried writing some code to normalize tags in the input dataset into a standard format:

Joey likes chasing <i>Sylvìlagus floridanus</i> and <i>Marmota monax</i>.
Joey likes chasing <x>Sylvìlagus floridanus</x> and <x>Marmota monax</x>.

Joey likes <a href="">store brand dog food</a>.
Joey likes <x>store brand dog food</x>.

Then at inference I could use the standardized tags to place tags in the output. The issue with this is that most of the data I’m currently using for Argos Translate only contains a handful of tags in this format which is likely not sufficient.

My current plan is to try to find/generate more data in this format but any suggestions for better strategies are greatly appreciated!



My understanding is that this should be handled during the training time; something like the following:

1. Training time:

  • pre-process your training data to replace any form of tags with a special token, say >>tag<<
  • train your model on this data

2. Translation time:

  • Build a list of the tags in the source text to know the order
  • Replace all the tags in the source text with the same special token you used during training
  • Use your ordered list to add the tags back to the target MT output by replacing the special token.

As you said, this might require introducing more tags in your training data; however, as you now use a special token for all tags, you do not have to worry about the exact tag format, i.e. augmenting your data with this special token should be enough.

Hope this helps.

Kind regards,


Thanks for the response @ymoslem! Any suggestions for generating training data?

Since there currently isn’t much valid tag data in the Opus data I’ve looked at. My best idea so far is to try to combine Opus data with Wiktionary definitions to generate tag data with the tag around a single word.


Joey is scared of swimming.
Joey tiene miedo de nadar.

swim -> nadar

Generated data:
Joey is scared of <x>swimming</x>.
Joey tiene miedo de <x>nadar</x>.

I think I also want some data with more than a single word in it though. There’s probably a way to do this with something like LASER but that would be pretty involved.



Hi P.J.,

I agree that the approach introduced by Hanneman and Dinu (2020) is worth trying. More explanation of it can be found here.

I do not really think having one word vs multiple words is a big deal. You just make it random to teach the model how to use the tags. You can still add the closing tag one or two words after the defined word, regardless of whether they are matching or not.

One thing to note, your special token must be unique, separated from the word(s), AND cannot be sub-tokenized.

Kind regards,

1 Like

I had missed section 3.1 in the linked paper Pierro pointed out that the paper describes using matching sub translations to generate tag data.


My experience as user is that heavily tagged files as DITA or HTML, are a pain for MT models. You cannot process them as they are. You need a parser to extract translatable info and the probably another inline parser for the translatable information (that you will use to feed a MT system). You probably can find a parser for DITA files, but not so sure for complex HTML files. And then the proposed tag reduction in tag families for training. (as a variable substituion I guess).

I have seen models just working with tags as words, models with basic approachs (probably as yours) and models that are able to perform very clever tag manipulation (for example changing tag order or even translate translated info (as alt=“XXXX”) inside the tags. But again, all these in very special contexts (translation tools with custom parsers with internal corpus).

You probably can emulate a tag behavior for quotes or parethesis in non tagged corpus, usually there are many instances.

But i guess the usual source for tagged files are translation memories TMX from html or dita segments. Some companies probably have hundred of millions of words (and money) to play with its documentation. (DITAS, HTML or MDs)

What I also see is that the translation industry is trying to move as much as they can to non tagged translation for several reasons (and not only because of the MT but the tools used (i.e. parsers or translations editors).

Well probably I did not help you. :slight_smile:
Have a nice day!

1 Like

Just came across this dataset and paper:

1 Like

Thanks! This looks like it’s probably higher quality than what I was getting with tag injection.

For Argos Translate for now I’ve decided to go with tag injection at inference (code, video) or just giving up the context between tags. I agree with @miguelknals and until there are language models that can translate entire documents at once it’s probably easiest to translate documents in context and not try to do complex parsing.

If you still need data augmentation suggestion…

I had to do something similar, but not for html tag. I just wanted to support having something perserved as is when within 2 tags.

I used punctuations. I kept all sentences that both target and source had exactly the same patern and order in their punctuations and i replace all the punctuation by my tags in both target and source.

You might be more limited for some languages that don’t have all punctuations signs, but it’s for you to pick the punctuation signs you want to use.

Seem to be working perfect so far.

And in my case, I just used one tag not 2 (one for opening and closing) I just replace all these tag in the same order of what ever I wanted it to replace them with. This has the benefit of supporting both just 1 word you want to keep as it… That you replace by just 1 tag or putting part of sentence between 2 tags and then replacing these 2 by three original opening/coding tag.

EDIT: Just in case you may want to add an additional feature; where if the machine fail to provide the same number of tag, to translate that specific sentence into chunks, but I guess this should be rare… I haven’t personally tried to see if this could happen after my training, as it was a “nice to have” in my case.

Best regards,

1 Like