Restoring of source formatting

Hello.

I have an experience with Moses engine and there we used Okapi Tikal tool for restoring of source formatting.
It works through replacement of documents fragments by special tags. For example:

<bold>word</bold> have to be transformed to<g id="1">word</g>

And Moses just leave these tags for output. After that translation can be fully restored by these tags.
It works pretty well for OpenOffice formats.

I understand how Neural network works but any idea how it can work as described before?
I’ve tried to use Okapi Tikal with tags but in many places I have broken or absent tags after translation.

Many thanks!

One possibility is to do this as a pre and postprocessing step, not passing the markup to the network at all. So you remove the markup from source before you pass it to the decoder and reinsert markup on the target side after decoding. You can do that using tools from m4loc like reinsert_wordalign.pm which takes a tokenized source segment with inline markup and the tokenized target, and reinserts the markup into the target on the basis of source segment-target segment alignment. This latter can be computed on the fly using force_align.py from the fast_align package (you will also need atools and fast_align itself). For this to work, during training you have to train forward and reverse alignment from you training data (tokenized and without markup, the same as you would use to train your network). You will find detailed usage info in the scripts. After reinsertion you can use fix_markup_ws.pm from m4loc to fix some whitespacing issues.

Perhaps you can even use the soft alignment from the decoder itself, I don’t know which might be better for this task.

1 Like

Thanks for idea !

Hi there,
Any clues how formatting of source and target sentences should be? I can’t find any useful information in m4loc on how these sentences should look or how to handle inline tags. I would appreciate some visual example.
Thanks

For example, you have a source file in docx format. You can use tikal to convert it into raw text with formatting stored in inline tags:
tikal.sh -fc okf_openxml -xm -seg segmentation_rules.srx source.docx -to source.mos

This way you get segments like this:

<x id="1"/><x id="2"/>In general and, more specifically,... enteritis due to <g id="3">Yersinia enterocolitica</g> or <g id="4">Yersinia pseudotuberculosis </g>and healthcare<x id="5"/>associated infections.

You send the content of the source.mos file to the decoder after removing the markup and reinsert the markup on the target side (target.mos file):

<x id="1"/><x id="2"/>En général et plus spécifiquement, l’entérite due à <g id="3">Yersinia enterocolitica</g> ou <g id="4">Yersinia pseudotuberculosis </g>et les infections<x id="5"/> associées aux soins.

The you can use tikal again to convert the target into docx with the formatting:
tikal.sh -lm source.docx -fc okf_openxml -sl en -ie utf8 -oe utf8 -overtrg -from target.mos -seg segmentation_rules.srx

Thanks! How do I send source text to the decoder? Just removing all tags and sending all text together, or sending chunks of text (for instance, sentences)?

One segment/sentence per line, either as an input file or stdin, depending on the decoder. OpenNMT translate.lua takes it as a file argument of -src (I think).

It’s clear now. Thanks for your time!

Hi,
One last thing! You said:

This latter can be computed on the fly using force_align.py from the fast_align 4 package (you will also need atools and fast_align itself). For this to work, during training you have to train forward and reverse alignment from you training data (tokenized and without markup, the same as you would use to train your network).

How do I get this to work? I guess I need the parallel corpus to train the model. How do I get the alignment of a given new sentence after fast_align has been trained? force_align.py asks for fwd_params and rev_params but I don’t know what they mean.

Thanks

You’ll get these files after you train with fast_align, one run for fwd_params and one for rev_params, eg.:
fast_align -i <input> -d -v -o -p fwd_params > fwd_align
fast_align -i <input> -r d -v -o -p rev_params > rev_align

tikal.sh -fc okf_openxml -xm -seg segmentation_rules.srx source.docx -to source.mos

Which are the segmentation rules used here?

There are sample rulesets in the tikal bundle, in the config directory, or you might build them for yourself.

Thanks for your time!

Hi,
Any more tools or ideas for reinserting inline tags apart from m4loc tool? m4loc seems to misplace some tags and therefore messing up tikal when recovering the document.