Perplexity as a determinant of text quality.

Rotimi Ajigboye — Tue, 29 Jul 2025 21:43:59 +0000

In Natural Language Processing, perplexity is a measure of how well a language model predicts a text sequence starting from the first token in the sequence.

Consider a Text sequence that starts with "Elephants". After the word "Elephants" there are a ton of possible options for the next word in the sequence. A few examples are below:

Elephants are...
Elephants do...
Elephants eat...
Elephants weigh...

Based on the data the language model has been trained with, each possible next word such as "are", "do", "eat", "weigh" has a probability assigned to it as the next word in the sequence. The higher the probability of this word, the more confident the language model is in that word being the next in the sequence.

Now let us assume that a piece of text begins with "Elephants eat". Again there are tons of possible next words such as "grass", "vegetables", "meat". Each option has a probability assigned to it and obviously "meat" would be less probable than "Grass" and "Vegetables".

Elephants Eat Grass - Most Probable
Elephants Eat Vegetables - Second Most Probable
Elephants Eat Meat - Least Probable

Simply put, perplexity is a measure of the overall probability of a text sequence. Mathematically it is calculated as below.

This shows an inverse relationship between perplexity and sequence probabilities, indicating that the more probable a sequence is, the less confusing it is to the model. The lower the perplexity of a text sequence, the more confident the model is in its prediction of every token in the sequence.

Calculating Perplexity of texts in python

Import the required libraries

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

Initiate your language model and its corresponding tokenizer.

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Define a function to calculate perplexity.

def Perplexity_Calc(text):
    input_ids = tokenizer.encode(text, return_tensors="pt", truncation=True, max_length=1024)
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
        perplexity = torch.exp(loss)
        print(f"Perplexity score: {perplexity.item()}")

Define your piece of text

text = "The quick brown fox jumps over the lazy dog"

Calculate Perplexity of the defined text

Perplexity_Calc(text)

Use Cases

Now that we understand what perplexity is, how can it be used in practice?

One interesting way I have used perplexity is as a selection criterion in an ensemble Optical Character Recognition (OCR) system.

There are numerous OCR systems out there, each with its own strengths and weaknesses. Some OCR systems perform exceptionally well on clean, high-resolution scanned documents with structured layouts and consistent fonts. Others are designed to handle more complex or variable inputs, such as handwritten notes, forms, or low-quality images, where text may be irregular, skewed, or overlapping with other elements. Each OCR model tends to have its own biases and limitations depending on the type of training data and preprocessing used.

In the example below, the same image is fed to four different OCR models and different results are obtained from each of the models.

From the results it is clear that Azure Cognitive Services and AWS textract both produced the best results while Tesseract and Google vision performed poorly.

Rather than relying on a single OCR model which could perform poorly in certain scenarios, I built an ensemble system that takes outputs from four different OCR models and uses perplexity to decide which one is likely the most accurate. Normally, evaluating the accuracy of OCR outputs requires comparing the outputs to the known original text. By using perplexity, the system can estimate which output makes the most sense linguistically, without ever seeing or knowing the ground truth (correct text). It simply picks the output with the lowest perplexity as the final result.

The outcome of this is an ensemble that produces better results over a diverse set of input images, than any of the individual models could achieve on their own.

So that's one way I've used perplexity. How would you use perplexity in your own projects?

DEV Community: Rotimi Ajigboye

Perplexity as a determinant of text quality.

Calculating Perplexity of texts in python

Use Cases