Build a Glossary


I was wondering if anyone experienced any way to generate a glossary with OpenNMT?

My goal is to generate a Glossary in order to manually fix it and then use it for training. I’m welcome to suggestion if there are better ways to do that out there!

Did you try fast_align? GitHub - clab/fast_align: Simple, fast unsupervised word aligner

1 Like

I have not, but I will give it a shot :slight_smile: thanks!

I have a Wiktionary scraping script for dictionary data.

Thank you, but I need to build a glossary or custom dictionary based on my own data :stuck_out_tongue:

But I keep in mind your script. It could be handy in some situation!

1 Like

Hello Terence,

Thanks for your suggestion. I’ve been playing around with fast_align and I’m really happy with the results!

I’m thinking to leverage it a little more by using a custom train model and use the evaluate aspect of it to get the predict score from the model on each aligned piece generated from fast-align. I wonder if you ever tried that? I’m hoping to get even better results with this kind of filtering.

Best regards,

1 Like

Hello Samuel, Glad you found fast_align useful, and you certainly got further than I did. I had intended to do some experiments but got side-tracked by other things :slight_smile:

Just to give some feedback:

In the end, I used fast_align to provide the “Pharao” like aligment.

From that I generate the 1 gram and then I assign a “accurate indicator” to each 1 gram based on frequency and percentage of the frequency of the source word. (Make sure to keep track of the id of the original sentence.) Then I generate all the n-gram and only keep the ones where the patterns meet theses requirements:

  • First word was considered accurate
  • Last word was considered accurate
  • No 2 consecutives word was considered not accurate
  • There most be atleast 2 consecutive word considered accurate between 2 word considered innacurate.
  • The maximum minus the minimum from the target sequence (word order) most be equal to the maximum minus the minimum of the source sequence (word order).
  • Target sequence (of the words) when reordered need to be all consecutives.

Using this technique provide me perfect n-grams. Of course, you need to do some fine tuning on the algo you use to determine what you consider accurate… but from the ngram I have looked at and the results of the models everything is perfect. I couldn’t be happier! My models are always right on, and even with a high beam search the suggestions are accurates.

So in the end, I didn’t have to train a model and rerun on top of it to filter out the ngrams (see my post above). Which would have added lots of overload in the training process.

Thanks again!


Good. Thanks for sharing, Samuel! Do you mean you used this glossary to augment your training data for an NMT model?

Yes sir. I used the ngrams in my training.