Dear Julius,
I think what you are referring to is a word aligner.
- You use one of these tools, fast_align, eflomal, or efmaral, to generate word alignment in the Pharaoh format.
- You use a Python script like this one here to generate phrases. Note that
alignment
takes the Pharaoh alignment, but you have to convert it first to a list of tuples. Also, a small edit, print() should have brackets for Python3, or maybe write to a file.
Here is an example of the input and output you can expect:
srctext = "etroit dans la plupart des pays africains"
trgtext = "narrow in most african countries"
alignment= [(0, 0), (1, 1), (3, 2), (6, 3), (5,4)] # eflomal
( 1) (0, 1) etroit β narrow
( 2) (0, 2) etroit dans β narrow in
( 3) (0, 3) etroit dans la β narrow in
( 4) (0, 4) etroit dans la plupart β narrow in most
( 5) (0, 5) etroit dans la plupart des β narrow in most
( 6) (0, 7) etroit dans la plupart des pays africains β narrow in most african countries
( 7) (1, 2) dans β in
( 8) (1, 3) dans la β in
( 9) (1, 4) dans la plupart β in most
(10) (1, 5) dans la plupart des β in most
(11) (1, 7) dans la plupart des pays africains β in most african countries
(12) (2, 4) la plupart β most
(13) (2, 5) la plupart des β most
(14) (2, 7) la plupart des pays africains β most african countries
(15) (3, 4) plupart β most
(16) (3, 5) plupart des β most
(17) (3, 7) plupart des pays africains β most african countries
(18) (4, 6) des pays β countries
(19) (4, 7) des pays africains β african countries
(20) (5, 6) pays β countries
(21) (5, 7) pays africains β african countries
(22) (6, 7) africains β african
Well, OpenNMT uses Neural Machine Translation; however, Statistical (Phrase-based) Machine Translation depends on similar word aligners (like the famous Giza++).
Note that OpenNMT{py,tf} can still generate these alignments with extra options (see report_align and with_alignments) after training during translation, but specialized word aligners should be more accurate.
I hope this helps!
Kind regards,
Yasmin