Restoring of source formatting

(Sergey Zhitansky) #1


I have an experience with Moses engine and there we used Okapi Tikal tool for restoring of source formatting.
It works through replacement of documents fragments by special tags. For example:

<bold>word</bold> have to be transformed to<g id="1">word</g>

And Moses just leave these tags for output. After that translation can be fully restored by these tags.
It works pretty well for OpenOffice formats.

I understand how Neural network works but any idea how it can work as described before?
I’ve tried to use Okapi Tikal with tags but in many places I have broken or absent tags after translation.

Many thanks!

(Csaba Oravecz) #2

One possibility is to do this as a pre and postprocessing step, not passing the markup to the network at all. So you remove the markup from source before you pass it to the decoder and reinsert markup on the target side after decoding. You can do that using tools from m4loc like which takes a tokenized source segment with inline markup and the tokenized target, and reinserts the markup into the target on the basis of source segment-target segment alignment. This latter can be computed on the fly using from the fast_align package (you will also need atools and fast_align itself). For this to work, during training you have to train forward and reverse alignment from you training data (tokenized and without markup, the same as you would use to train your network). You will find detailed usage info in the scripts. After reinsertion you can use from m4loc to fix some whitespacing issues.

Perhaps you can even use the soft alignment from the decoder itself, I don’t know which might be better for this task.

(Sergey Zhitansky) #3

Thanks for idea !

XML tags handling