Hi
I’m using the OpnNMT-tf repository, I could not find any extensive tutorial on how to create/use guided alignments or phrase tables for training the out of box models from catalog.py. All I could manage to find was a small section on ‘Alignments’ in this page - http://opennmt.net/OpenNMT-tf/data.html
I’m really new to this so here are couple of queries I have -
What’s the best way to start using/generating alignments and/or phrase tables. ?
I have a domain specific phrases both in source and target language that I want to model to use while training, how do I add them to the training.
Any guidance is appreciated.
Sorry if this is a duplicate question, as I have not used Lua or other repos from OpenNMT before.
The alignment file can be produced with fast_align for example:
Then, you just need to extend the configuration to enable guided alignment:
data:
train_alignments: train-alignment.txt
params:
guided_alignment_type: ce
guided_alignment_weight: 1
Regarding phrase tables, the documentation does not mention this feature so that is something you need to handle on your side using the alignments that are returned by the model.
Where can I see all available params associated with ‘train_alignments’ , so I can use different types etc…
For phrase tables, if I’m reading it correctly -
This functionality is in other repositories (With the ‘-phrase_table’ option) but its not yet implemented in OpenNMT-tf
To return the alignment history what params do I need to set (assuming I’m using one of the out of the box models like Transformer / NMTBig etc. from catalog.py) ?
Yes, I was referring to the same guided_alignment_* params, was curios where is project code-base they are being set and if there are any other ones apart from guided_alignment_type and guided_alignment_weight I should be aware of.
For phrase tables,
Inference makes sense infer option with with_alignments:hard. I’m still confused how do I add the custom domain phrases that I have.
Thanks for the link.
Yes, I’m looking for domain adaption.(sorry if i did not mention it earlier)
I did find some really good posts in this forum for Adaptive training,but they are all geared towards OpenNMT-Lua or OpenNMT-py repositories. For OpenNMT-tf repo I could not see many, just wanted to get thoughts if the below is possible:
Train a base model without any alignments file -
Fine-tune the above model (Domain Adaption) using
a - Domain specific data with Alignments file and new vocabulary
For the fine tuning of the model, from the posts I gather that the vocabulary cannot be changed (unless we use tricks like BPE etc. ).
Also, does alignment file need to be present even in the base model. ?
It’s the same approach for OpenNMT-tf. There are additional pointers in this section of the documentation, including how to change the vocabulary which is actually possible:
Alignments files are not specifically related to domain adaption. Where did you read that? They are mostly helpful if you are training Transformer models but still want to retrieve alignment information for additional postprocessing.
No worries. It was separate question not related to domain adaption.
Guess, I can then add alignment files while performing the second step - fine tuning(assuming I have not added while training base model-first step), hoping that it will help the Transformer model capture domain structure better.
@guillaumekln
Follow up question,
I managed to successfully create the alignment file in the pharaoh format (fastalign needed a tokenized file as input so I ran the train datasets through onmt-tokenize-text module using below configuration file mode: aggressive joiner_annotate: true segment_numbers: true segment_alphabet_change: true ).
I’m planning to run tokenization and BPE on the fly, my question is on the source and the target vocabulary files that I need to pass in the training configuration file source_words_vocabulary and target_words_vocabulary , do I generate this on the raw files or (tokenized and bpe passed files). I’m using the onmt-build-vocab module
Presently I have this: onmt-build-vocab --tokenizer OpenNMTTokenizer --tokenizer_config /home/ubuntu/mukund_onmt/token_config_bpe.yml --size 50000 --save_vocab onmt_bpe_vocab/src_vocab.txt un_train_tokenized.en