I’m using the OpnNMT-tf repository, I could not find any extensive tutorial on how to create/use guided alignments or phrase tables for training the out of box models from catalog.py. All I could manage to find was a small section on ‘Alignments’ in this page - http://opennmt.net/OpenNMT-tf/data.html
I’m really new to this so here are couple of queries I have -
What’s the best way to start using/generating alignments and/or phrase tables. ?
I have a domain specific phrases both in source and target language that I want to model to use while training, how do I add them to the training.
Any guidance is appreciated.
Sorry if this is a duplicate question, as I have not used Lua or other repos from OpenNMT before.
The alignment file can be produced with fast_align for example:
Then, you just need to extend the configuration to enable guided alignment:
Regarding phrase tables, the documentation does not mention this feature so that is something you need to handle on your side using the alignments that are returned by the model.
Thanks @guillaumekln .
Where can I see all available params associated with ‘train_alignments’ , so I can use different types etc…
For phrase tables, if I’m reading it correctly -
- This functionality is in other repositories (With the ‘-phrase_table’ option) but its not yet implemented in OpenNMT-tf
- To return the alignment history what params do I need to set (assuming I’m using one of the out of the box models like Transformer / NMTBig etc. from catalog.py) ?
Appreciate your help !
I’m not sure which params you are referring to. The
guided_alignment_* options are the only ones that are related to
It is already returned by the model, it then depends how you use it:
- For file translation, set the
with_alignments: hard to display alignments
- For model server, “alignment” is already one of the returned field.
Yes, I was referring to the same
guided_alignment_* params, was curios where is project code-base they are being set and if there are any other ones apart from
guided_alignment_weight I should be aware of.
For phrase tables,
Inference makes sense
infer option with
with_alignments:hard. I’m still confused how do I add the custom domain phrases that I have.
The full list of params is here:
Are you looking to achieve domain adaptation? If so, this has been discussed several times on the forum.
Thanks for the link.
Yes, I’m looking for domain adaption.(sorry if i did not mention it earlier)
I did find some really good posts in this forum for Adaptive training,but they are all geared towards
OpenNMT-py repositories. For
OpenNMT-tf repo I could not see many, just wanted to get thoughts if the below is possible:
- Train a base model without any alignments file -
- Fine-tune the above model (Domain Adaption) using
a - Domain specific data with Alignments file and new vocabulary
For the fine tuning of the model, from the posts I gather that the vocabulary cannot be changed (unless we use tricks like BPE etc. ).
Also, does alignment file need to be present even in the base model. ?
It’s the same approach for OpenNMT-tf. There are additional pointers in this section of the documentation, including how to change the vocabulary which is actually possible:
Alignments files are not specifically related to domain adaption. Where did you read that? They are mostly helpful if you are training Transformer models but still want to retrieve alignment information for additional postprocessing.
Thanks. It makes more sense now.
No worries. It was separate question not related to domain adaption.
Guess, I can then add alignment files while performing the second step - fine tuning(assuming I have not added while training base model-first step), hoping that it will help the Transformer model capture domain structure better.
Follow up question,
I managed to successfully create the alignment file in the pharaoh format (fastalign needed a tokenized file as input so I ran the train datasets through onmt-tokenize-text module using below configuration file
segment_alphabet_change: true ).
I’m planning to run tokenization and BPE on the fly, my question is on the source and the target vocabulary files that I need to pass in the training configuration file
target_words_vocabulary , do I generate this on the raw files or (tokenized and bpe passed files). I’m using the onmt-build-vocab module
Presently I have this:
onmt-build-vocab --tokenizer OpenNMTTokenizer --tokenizer_config /home/ubuntu/mukund_onmt/token_config_bpe.yml --size 50000 --save_vocab onmt_bpe_vocab/src_vocab.txt un_train_tokenized.en
token_config_bpe.yml looks like below:
You should run the build-vocab script on the tokenized files, the ones that will be used during the training.
Thanks for the confirmation. I think I’m using the same to build the vocab.