OpenNMT-tf - how to use alignments and phares tables

mayub · November 5, 2018, 9:57pm

Hi
I’m using the OpnNMT-tf repository, I could not find any extensive tutorial on how to create/use guided alignments or phrase tables for training the out of box models from catalog.py. All I could manage to find was a small section on ‘Alignments’ in this page - http://opennmt.net/OpenNMT-tf/data.html
I’m really new to this so here are couple of queries I have -
What’s the best way to start using/generating alignments and/or phrase tables. ?
I have a domain specific phrases both in source and target language that I want to model to use while training, how do I add them to the training.

Any guidance is appreciated.

Sorry if this is a duplicate question, as I have not used Lua or other repos from OpenNMT before.

Thanks !

Mohammed Ayub

guillaumekln · November 6, 2018, 8:49am

Hi,

The alignment file can be produced with fast_align for example:

Then, you just need to extend the configuration to enable guided alignment:

data:
  train_alignments: train-alignment.txt

params:
  guided_alignment_type: ce
  guided_alignment_weight: 1

Regarding phrase tables, the documentation does not mention this feature so that is something you need to handle on your side using the alignments that are returned by the model.

mayub · November 6, 2018, 2:41pm

Thanks @guillaumekln .

Where can I see all available params associated with ‘train_alignments’ , so I can use different types etc…

For phrase tables, if I’m reading it correctly -

This functionality is in other repositories (With the ‘-phrase_table’ option) but its not yet implemented in OpenNMT-tf
To return the alignment history what params do I need to set (assuming I’m using one of the out of the box models like Transformer / NMTBig etc. from catalog.py) ?

Appreciate your help !

Mohammed Ayub

guillaumekln · November 6, 2018, 2:47pm

I’m not sure which params you are referring to. The guided_alignment_* options are the only ones that are related to train_alignments.

It is already returned by the model, it then depends how you use it:

For file translation, set the infer option with_alignments: hard to display alignments
For model server, “alignment” is already one of the returned field.

mayub · November 6, 2018, 3:07pm

Yes, I was referring to the same guided_alignment_* params, was curios where is project code-base they are being set and if there are any other ones apart from guided_alignment_type and guided_alignment_weight I should be aware of.

For phrase tables,
Inference makes sense infer option with with_alignments:hard. I’m still confused how do I add the custom domain phrases that I have.

-Mohammed Ayub

guillaumekln · November 6, 2018, 3:09pm

The full list of params is here:

http://opennmt.net/OpenNMT-tf/configuration_reference.html

Are you looking to achieve domain adaptation? If so, this has been discussed several times on the forum.

mayub · November 6, 2018, 4:17pm

Thanks for the link.
Yes, I’m looking for domain adaption.(sorry if i did not mention it earlier)

I did find some really good posts in this forum for Adaptive training,but they are all geared towards OpenNMT-Lua or OpenNMT-py repositories. For OpenNMT-tf repo I could not see many, just wanted to get thoughts if the below is possible:

Train a base model without any alignments file -
Fine-tune the above model (Domain Adaption) using
a - Domain specific data with Alignments file and new vocabulary

For the fine tuning of the model, from the posts I gather that the vocabulary cannot be changed (unless we use tricks like BPE etc. ).

Also, does alignment file need to be present even in the base model. ?

Thanks !

Mohammed Ayub

guillaumekln · November 6, 2018, 4:30pm

It’s the same approach for OpenNMT-tf. There are additional pointers in this section of the documentation, including how to change the vocabulary which is actually possible:

http://opennmt.net/OpenNMT-tf/training.html#fine-tune-an-existing-model

Alignments files are not specifically related to domain adaption. Where did you read that? They are mostly helpful if you are training Transformer models but still want to retrieve alignment information for additional postprocessing.

mayub · November 6, 2018, 4:51pm

Thanks. It makes more sense now.

No worries. It was separate question not related to domain adaption.

Guess, I can then add alignment files while performing the second step - fine tuning(assuming I have not added while training base model-first step), hoping that it will help the Transformer model capture domain structure better.

Mohammed Ayub

mayub · November 14, 2018, 2:43pm

@guillaumekln
Follow up question,
I managed to successfully create the alignment file in the pharaoh format (fastalign needed a tokenized file as input so I ran the train datasets through onmt-tokenize-text module using below configuration file
mode: aggressive
joiner_annotate: true
segment_numbers: true
segment_alphabet_change: true ).

I’m planning to run tokenization and BPE on the fly, my question is on the source and the target vocabulary files that I need to pass in the training configuration file source_words_vocabulary and target_words_vocabulary , do I generate this on the raw files or (tokenized and bpe passed files). I’m using the onmt-build-vocab module

Presently I have this:
onmt-build-vocab --tokenizer OpenNMTTokenizer --tokenizer_config /home/ubuntu/mukund_onmt/token_config_bpe.yml --size 50000 --save_vocab onmt_bpe_vocab/src_vocab.txt un_train_tokenized.en

token_config_bpe.yml looks like below:

mode: aggressive
joiner_annotate: true
segment_numbers: true
segment_alphabet_change: true
bpe_model_path: /home/ubuntu/mayub/datasets/in_use/un/merged_bpe_model/merged_en_es.bpe

Mohammed Ayub

guillaumekln · November 14, 2018, 3:00pm

You should run the build-vocab script on the tokenized files, the ones that will be used during the training.

mayub · November 14, 2018, 3:23pm

Thanks for the confirmation. I think I’m using the same to build the vocab.

Mohammed Ayub