Logprobs in Production: 4 Things You Can Actually Do With Them

#llm #ai #python #observability

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ship a classifier. It picks one of four labels for every support ticket. It looks fine in evals at 92% accuracy. The dashboard goes green and you go to lunch. Two weeks later somebody points out that the wrong-label tickets are silently routing themselves to the wrong queue and a small fraction of refund requests are sitting in a billing inbox nobody reads. The model never said "I'm not sure." It just said refund with the same flat voice it used for the 91 other refunds that morning.

The thing is, the model knew. The probability it assigned to refund on that token was 0.34. The next token, billing, was 0.31. It was a coin flip dressed up as a decision. You just threw the coin-flip data away when you ignored the logprobs.

Logprobs are the one thing your inference call returns that nobody on your team is using. Four patterns below are worth wiring in this quarter. They each take a few lines of Python, and they each catch a class of failure your eval set will not.

Vendor support, April 2026

Before any of this is useful, check whether your provider exposes them. As of April 2026:

OpenAI Chat Completions supports logprobs=True and top_logprobs, per the OpenAI cookbook on using logprobs. The cookbook walkthrough uses values 0 to 5; check the current API reference for the live ceiling on whichever model snapshot you target.
Together AI supports logprobs=1 and returns tokens, token_logprobs, and top_logprobs, per their docs.
Groq does not currently support logprobs, logit_bias, or top_logprobs and returns 400 if you pass them, per the Groq OpenAI-compat page.
Anthropic's native Claude API does not expose token logprobs. Third-party wrappers and OpenAI-compat layers may show a logprobs: null field; treat that as not-supported and route logprob-dependent paths through a model that returns them. Sophia Willows has a good writeup on logprobs in practice that calls this gap out directly.

The patterns below assume you can hit a model that returns logprobs for at least the routing or classification surface where you want this signal. None of it requires logprobs on every call.

1. Confidence escalation on classifier outputs

The first pattern is the simplest. Your model classifies. You read the logprob of the first answer token, convert it to a probability, and escalate when it's below a threshold.

import math
from openai import OpenAI

client = OpenAI()

LABELS = ["refund", "billing", "shipping", "other"]

PROMPT = """Classify this ticket. Reply with exactly one word
from the list: refund, billing, shipping, other.

Ticket: {ticket}
Label:"""


def classify(ticket: str) -> tuple[str, float]:
    r = client.chat.completions.create(
        model="gpt-4o-2024-11-20",
        messages=[{
            "role": "user",
            "content": PROMPT.format(ticket=ticket),
        }],
        max_tokens=4,
        temperature=0,
        logprobs=True,
        top_logprobs=5,
    )
    choice = r.choices[0]
    label = choice.message.content.strip().lower()
    first_token = choice.logprobs.content[0]
    prob = math.exp(first_token.logprob)
    return label, prob

The top_logprobs=5 is the diagnostic surface, even if you only act on the chosen token's logprob. You'll need it in two minutes for pattern two.

The escalation rule is plain control flow. Pick a threshold from your eval set, then route the unsure ones somewhere a human can look.

def route(ticket: str) -> str:
    label, p = classify(ticket)
    if p < 0.65:
        send_to_human_queue(ticket, label, p)
        return "queued_for_review"
    return label

A reasonable starting threshold is 0.65 for four-way classification, 0.55 for binary. The right number is "wherever your false-positive curve elbow lives," and you find it by sweeping thresholds against a labeled set. The point is that 0.34 refund and 0.94 refund should not get the same downstream treatment. Today they do.

A note on calibration. Raw probabilities from instruction-tuned models are not calibrated to true frequencies, so a 0.7 logprob does not mean 70% accuracy. Treat the threshold as a tuning parameter that drifts with each model swap, and re-fit it whenever you change models.

2. Hedging detection from token entropy

Hedging looks different from low confidence. Instead of looking at the chosen token, you look at the distribution across the top alternatives at a load-bearing position.

A model that's hedging spreads its mass. The chosen token sits at 0.41, the runner-up at 0.36, the third at 0.18. That's high entropy. The model has a guess, but it does not have a strong one.

def token_entropy(top_logprobs) -> float:
    probs = [math.exp(t.logprob) for t in top_logprobs]
    s = sum(probs)
    norm = [p / s for p in probs]
    return -sum(p * math.log(p) for p in norm if p > 0)

The entropy is computed over the top-k slice, not the full vocabulary. That's a known approximation: true vocabulary entropy is not exposed by any commercial API. The approximation holds when top-5 mass is above 0.95. On a well-prompted classifier, that's the common case. Spot-check with sum(probs) in a notebook before you trust it on a new prompt shape.

For a four-way classification, the maximum entropy of a uniform-over-5 distribution is ln(5) ≈ 1.61. Anything above 1.20 is worth flagging.

def classify_with_entropy(ticket: str):
    r = client.chat.completions.create(
        model="gpt-4o-2024-11-20",
        messages=[{
            "role": "user",
            "content": PROMPT.format(ticket=ticket),
        }],
        max_tokens=4,
        temperature=0,
        logprobs=True,
        top_logprobs=5,
    )
    first = r.choices[0].logprobs.content[0]
    e = token_entropy(first.top_logprobs)
    return r.choices[0].message.content.strip(), e

The metric pairs nicely with pattern 1. Confidence below 0.65 OR entropy above 1.20 routes to review. Low confidence flags the model that has no idea. High entropy flags the model that's torn between two options. You want both.

Log both numbers per call. When one of them shifts overnight, you've caught a prompt regression before your accuracy metric notices.

3. Input perplexity as a jailbreak signal

This one is the contrarian use case, and it requires a base or completion-style model surface that scores the input as a sequence. The setup: you score the user's input under a model and compute its perplexity. Anomalously low perplexity on a free-form query is a weak but real signal of templated jailbreak attempts and copy-pasted prompt injections. Low perplexity means the input is suspiciously close to text the model saw at training time.

The intuition. Real users write messy, personal queries. "where the heck is my package, ordered tuesday." Those have perplexity in the 30-120 range on a typical English corpus model. A jailbreak prompt that's been pasted from a Discord channel into a thousand chatbots has been seen in some form across the web. The model recognizes its rhythm. Perplexity drops below 15.

This is not a security control. It's an anomaly detector. Pair it with a real moderation pass; do not replace one.

import math


def perplexity(prompt: str, model="gpt-3.5-turbo-instruct") -> float:
    r = client.completions.create(
        model=model,
        prompt=prompt,
        max_tokens=0,
        echo=True,
        logprobs=0,
    )
    lp = r.choices[0].logprobs.token_logprobs
    lp = [x for x in lp if x is not None]
    if not lp:
        return float("inf")
    avg = sum(lp) / len(lp)
    return math.exp(-avg)

The echo=True, max_tokens=0 trick lets a completion-style endpoint score the prompt itself. OpenAI keeps gpt-3.5-turbo-instruct available for this kind of use; if it goes away, swap to a Together AI base model that supports completions. The Chat Completions endpoint will not score input tokens — it only returns logprobs for output.

Pick a low threshold (perplexity below 12 on a tokenized English query is suspicious) and feed it into a wider risk score along with rate-limit signals, IP reputation, and your usual moderation pipeline. Short, common phrases like "thank you" will trip it, so the false-positive rate is too high to use this alone. Multi-sentence inputs are where it earns its slot.

If perplexity scoring is not available on your stack, skip this rung. The other three carry their weight without it.

4. Entropy-based routing to a bigger model

The cost-saver of the four. You run cheap-and-fast for everything, measure entropy on the response, and re-run the high-entropy ones on a bigger model.

def classify_with_entropy_big(ticket: str):
    r = client.chat.completions.create(
        model="gpt-4o-2024-11-20",
        messages=[{
            "role": "user",
            "content": PROMPT.format(ticket=ticket),
        }],
        max_tokens=4,
        temperature=0,
        logprobs=True,
        top_logprobs=5,
    )
    first = r.choices[0].logprobs.content[0]
    e = token_entropy(first.top_logprobs)
    return r.choices[0].message.content.strip(), e


def two_tier(ticket: str) -> str:
    label, e = classify_with_entropy(ticket)
    if e < 0.8:
        return label
    label2, _ = classify_with_entropy_big(ticket)
    return label2

classify_with_entropy_big is the same function as classify_with_entropy, just pointed at a larger model (swap gpt-4o-mini in the cheap version, gpt-4o-2024-11-20 in the bigger one). The bigger model often resolves the ambiguity the smaller one was hedging on. When it doesn't, your two-of-two disagreement is the strongest signal yet that the input deserves a human.

The economics. If 12% of traffic hits the fallback at 5x the cost, your average cost is 0.88 * 1 + 0.12 * 5 = 1.48 units, against a flat 5-unit cost for routing everything to the big model. That's ~70% cheaper, and on the slice that matters you still get the better answer. The exact savings depend on your traffic mix and the price gap; calculate yours before committing.

If you've got classifier or routing traffic in prod and you're sending it all to one model, you're paying for the big model on the easy 80%.

What to wire first

Pattern 1 is a one-day change with immediate signal. Pattern 2 sits on the same call and earns its keep the first time you have a prompt regression. Pattern 4 is a multi-day project but the one your finance team will notice. Pattern 3 is the "if you have time" rung. Useful, but skip if your stack doesn't support it.

Wire pattern 1 first. The rest will earn their slot when you watch the dashboards for a week.

If this was useful

Confidence signals, calibration, and the eval traces that catch silent regressions are the day-job topics in the LLM Observability Pocket Guide. It walks through how to thread logprobs into traces, how to set thresholds that survive a model upgrade, and where they fit alongside golden sets and judge models. If your team is shipping LLM features faster than you can verify them, it's the book for that gap.