Comparing perplexity across different subword segmentations

Hello everyone,

I have a question regarding the comparison of subword perplexities across different segmentations.

I’ve encoded a technical domain Q&A dataset (English-to-English) with a different amount of BPE merge operations (32k, 16k, 4k, 3k), that is, 4 datasets in total each with a different byte-pair encoding. A Transformer-base model is trained on each dataset (which has been split into train, val and test).

Focusing on comparing the perplexity of these models:

As noted here, the amount of tokens in the denominator of the perplexity formula, |Y|, will vary for the same sentence depeneding on whether it’s e.g. a word-level or character-level model. Thereby, the perplexity will be affected by the segmentation of the data and cannot be directly compared.

As such, I’m wondering how I would go about comparing the translation perplexities for my models:

Guided by this post, I’m wondering whether it would make sense to use the translate.py script, providing segmenteted data from the test split for --src, and unsegmented data as true data with --tgt.

If I understand it correctly, the gold perplexity would then be the models own predictions computed by the following formula:

Where the loss would be the cumulative negative log likelihood of the models own predicition and the num_target_words would be the amount of words/tokens in the unsegmented data provided by --tgt. As such, would |Y| be the amount of tokens in the unsegmented data?

Kind Regards,
Lukas