I’ve read the Sequence-level Knowledge Distillation paper and want to implement the basic word-level distillation on the Opennmt-py codebase. More precisely, I am using the average distribution of a 10-baseline-translator ensemble as the soft targets (the baseline translators are differently initialized, the ensemble outperforms the baseline). However, my current experimental results on IWSLT 2014 shows my distillation translator lags the baseline by about 0.8 in BLEU. So I am asking some implementation details in the distillation paper.
One problem I faced is the dimension of the soft targets over the whole vocabulary is extremely large? My current implementation would first dump the soft targets of the whole dataset to the disk, then load it and run the distilling training. So it’s not feasible to dump the whole vocabulary. I’ve used top 100 possible words as an approximation. My question is I am doing it in the right direction. How is this problem handled in the distillation paper?