Possible bug in `merge_vocabs` in `onmt/io/IO.py`?


(Christian Hadiwinoto) #1


I spotted a possible bug in merge_vocabs function (line 77-92 of onmt/io/IO.py). When we set the minimum frequency of the source and the target vocabularies to be > 1, the minimum frequency constraint will be lost, as merge_vocabs populates the vocab.freqs of the source and target vocabularies, containing the unfiltered vocabulary. See Torchtext package code (line 64-67 of text/vocab.py), where filtered-by-frequency vocabulary is stored in .itos but not in .freqs.

Would it be better to add a piece of code between line 88 and 89 of onmt/io/IO.py, such that merged frequency counts only contain words that exist in vocab.itos?

(Guillaume Klein) #2


You should certainly open an issue on the OpenNMT-py repo. There will be more developers there to discuss and eventually address this.