I’m looking into adding XML support to Argos Translate (#23).
The difficult part is that tags in the source sentence need to be placed correctly into the target sentence, ex:
Joey is a <b>good</b> dog! ¡Joey es un <b>buen</b> perro!
This clearly needs to be done by the seq2seq model because words within a tag need to be translated in the context of the surrounding words.
I’ve tried writing some code to normalize tags in the input dataset into a standard format:
Joey likes chasing <i>Sylvìlagus floridanus</i> and <i>Marmota monax</i>. Joey likes chasing <x>Sylvìlagus floridanus</x> and <x>Marmota monax</x>. Joey likes <a href="https://www.wegmans.com">store brand dog food</a>. Joey likes <x>store brand dog food</x>.
Then at inference I could use the standardized tags to place tags in the output. The issue with this is that most of the data I’m currently using for Argos Translate only contains a handful of tags in this format which is likely not sufficient.
My current plan is to try to find/generate more data in this format but any suggestions for better strategies are greatly appreciated!