Build a Glossary

SamuelLacombe · August 30, 2021, 8:54pm

Hello,

I was wondering if anyone experienced any way to generate a glossary with OpenNMT?

My goal is to generate a Glossary in order to manually fix it and then use it for training. I’m welcome to suggestion if there are better ways to do that out there!

tel34 · August 31, 2021, 8:14am

Did you try fast_align? GitHub - clab/fast_align: Simple, fast unsupervised word aligner

SamuelLacombe · August 31, 2021, 8:15am

I have not, but I will give it a shot thanks!

argosopentech · September 19, 2021, 9:27pm

I have a Wiktionary scraping script for dictionary data.

onmt-models/generate-wiktionary-data at 9b51b1446421e859644725058c37d3354d9e3d8d · argosopentech/onmt-models · GitHub

SamuelLacombe · September 22, 2021, 2:57am

Thank you, but I need to build a glossary or custom dictionary based on my own data

But I keep in mind your script. It could be handy in some situation!

SamuelLacombe · December 6, 2021, 4:34pm

Hello Terence,

Thanks for your suggestion. I’ve been playing around with fast_align and I’m really happy with the results!

I’m thinking to leverage it a little more by using a custom train model and use the evaluate aspect of it to get the predict score from the model on each aligned piece generated from fast-align. I wonder if you ever tried that? I’m hoping to get even better results with this kind of filtering.

Best regards,
Samuel

tel34 · December 6, 2021, 7:24pm

Hello Samuel, Glad you found fast_align useful, and you certainly got further than I did. I had intended to do some experiments but got side-tracked by other things

SamuelLacombe · January 31, 2022, 4:47am

Just to give some feedback:

In the end, I used fast_align to provide the “Pharao” like aligment.

From that I generate the 1 gram and then I assign a “accurate indicator” to each 1 gram based on frequency and percentage of the frequency of the source word. (Make sure to keep track of the id of the original sentence.) Then I generate all the n-gram and only keep the ones where the patterns meet theses requirements:

First word was considered accurate
Last word was considered accurate
No 2 consecutives word was considered not accurate
There most be atleast 2 consecutive word considered accurate between 2 word considered innacurate.
The maximum minus the minimum from the target sequence (word order) most be equal to the maximum minus the minimum of the source sequence (word order).
Target sequence (of the words) when reordered need to be all consecutives.

Using this technique provide me perfect n-grams. Of course, you need to do some fine tuning on the algo you use to determine what you consider accurate… but from the ngram I have looked at and the results of the models everything is perfect. I couldn’t be happier! My models are always right on, and even with a high beam search the suggestions are accurates.

So in the end, I didn’t have to train a model and rerun on top of it to filter out the ngrams (see my post above). Which would have added lots of overload in the training process.

Thanks again!

ymoslem · January 31, 2022, 1:52pm

Good. Thanks for sharing, Samuel! Do you mean you used this glossary to augment your training data for an NMT model?

SamuelLacombe · January 31, 2022, 8:55pm

Yes sir. I used the ngrams in my training.