Creating phrase tables out of alignment output for translating UNKs

"Alternatively, advanced users may prefer to provide a preconstructed phrase table from an external aligner (such as fast_align) using the -phrase_table […]

The phrase table is a file with one translation per line in the format:

source ||| target "

How can I actually convert 0-0 1-1 2-4 3-2 4-3 5-5 6-6 into source ||| target ?


Hi mehmedes,

There are many ways one can create phrase tables.
With the source to target word alignment, the simplest approach is to use the most frequent sense (i.e. given a source word, replace it with the most frequently used target word.)
For this, first you can extract all translation pairs from your alignment.
Alignment ‘0-0’ means the first token in source sentence is aligned to the first word in the target sentence, ‘2-4’, from the third to the fifth, and so on.
Afterwards, per each source word, you can find the target word that was most frequently aligned to it.

Does this answer your question?
Hopefully, OpenNMT can include a short script that can automate this process soon.

Dear OpenNMT Team,

Could you provide us with a script to convert fastalign output into phrase tables that can be used during decoding process to translate UNKs?


1 Like