I was wondering if anyone experienced any way to generate a glossary with OpenNMT?
My goal is to generate a Glossary in order to manually fix it and then use it for training. I’m welcome to suggestion if there are better ways to do that out there!
Thanks for your suggestion. I’ve been playing around with fast_align and I’m really happy with the results!
I’m thinking to leverage it a little more by using a custom train model and use the evaluate aspect of it to get the predict score from the model on each aligned piece generated from fast-align. I wonder if you ever tried that? I’m hoping to get even better results with this kind of filtering.
Hello Samuel, Glad you found fast_align useful, and you certainly got further than I did. I had intended to do some experiments but got side-tracked by other things
In the end, I used fast_align to provide the “Pharao” like aligment.
From that I generate the 1 gram and then I assign a “accurate indicator” to each 1 gram based on frequency and percentage of the frequency of the source word. (Make sure to keep track of the id of the original sentence.) Then I generate all the n-gram and only keep the ones where the patterns meet theses requirements:
First word was considered accurate
Last word was considered accurate
No 2 consecutives word was considered not accurate
There most be atleast 2 consecutive word considered accurate between 2 word considered innacurate.
The maximum minus the minimum from the target sequence (word order) most be equal to the maximum minus the minimum of the source sequence (word order).
Target sequence (of the words) when reordered need to be all consecutives.
Using this technique provide me perfect n-grams. Of course, you need to do some fine tuning on the algo you use to determine what you consider accurate… but from the ngram I have looked at and the results of the models everything is perfect. I couldn’t be happier! My models are always right on, and even with a high beam search the suggestions are accurates.
So in the end, I didn’t have to train a model and rerun on top of it to filter out the ngrams (see my post above). Which would have added lots of overload in the training process.