How to create a phrase table?

sunshine6789 · September 4, 2017, 8:32am

Hello,I want to use -phrase_table option when translating ,but I don’t know how to creat a phrase table ,will it use PhraseTable.lua ? can you tell me the method ,thanks!

tel34 · September 4, 2017, 1:11pm

Hi,
Currently the phase table takes the form of a text file with a single source token and target token on each line, separated by |||, e.g.
steekvarken|||hedgehog
Make sure you don’t have any empty lines in your file, otherwise PhraseTable.lua will fail.

sunshine6789 · September 6, 2017, 8:24am

Thanks for your answer,but phrase table is a file with one translation per line in the format: source ||| target ", but what I got is “0-0 0-1 1-1 1-2 2-2 3-3 4-4 4-5 5-4 5-6 6-7”,so how can I transform “0-0 0-1 1-1 1-2 2-2 3-3 4-4 4-5 5-4 5-6 6-7” into source ||| target ? Thanks!

emartinezVic · September 6, 2017, 8:45am

It looks like you are looking at the alignments file (from giza or fast-align, “aligned.grow-diag-final-and” file) instead of the (Moses) phrase tables that usually their rows look like :

phrase source ||| phrase target ||| scores from different features || phrase alignment ||| ---- ||| |||
! ! ! Really ? ||| ! ! ! really ? ||| 1 1.36967e-06 1 0.00692579 ||| 0-0 2-0 0-1 1-1 1-2 3-3 4-4 ||| 1 1 1 ||| |||

So, you should create your own text file as @tel34 has said before and as it is said here:
http://opennmt.net/OpenNMT/translation/unknowns/
notice that it says:

Where source and target are case sensitive and single tokens.

This is, each line of your text file should contain only one source word and its related target word.

Maybe a good starting point would be to create your text file from the lex.e2f file from your Moses model that looks like " source_word target_word probability ", and rewrite it as “source_word ||| target_word”

Good luck!

tel34 · September 6, 2017, 9:03am

Hi again,
Although the phrase table rule requires single case sensitive source & target tokens, to get around the
fact there are no many compound nouns in Dutch (& other Germanic languages) I have introduced
the underscore to join up the English components, e.g.
toezichthoudersraad|||board_of_supervisors
The underscore is removed in a post-processing script outside OpenNMT.
Terence