Hi,
I spotted a possible bug in merge_vocabs
function (line 77-92 of onmt/io/IO.py
). When we set the minimum frequency of the source and the target vocabularies to be > 1, the minimum frequency constraint will be lost, as merge_vocabs
populates the vocab.freqs
of the source and target vocabularies, containing the unfiltered vocabulary. See Torchtext package code (line 64-67 of text/vocab.py
), where filtered-by-frequency vocabulary is stored in .itos
but not in .freqs
.
Would it be better to add a piece of code between line 88 and 89 of onmt/io/IO.py
, such that merged
frequency counts only contain words that exist in vocab.itos
?