LLM Response Caching: When the 80/20 Hit Rate Saves the Bill

#ai #llm #python #redis

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You run a content moderation classifier. The same fifty memes show up in your queue every hour because they are trending across every account. The same support intent ("password reset") fires from a thousand chats a day. Your CI eval suite reruns the same prompt set every time someone touches a config file. Each of those calls hits the model, pays for the tokens, and returns the same answer it returned an hour ago.

This is not what Anthropic's prompt caching is for. Prompt caching trims the cost of the input on a cache hit. You still pay for the output, and the model still runs. Useful when the prefix is long and shared across many calls with different tails. Different problem.

Response caching is the cheaper sibling. You hash the entire request: model, full prompt, tool set, sampling params. If you have seen it before, you return the stored response from Redis without calling the model at all. Zero tokens. Zero latency past the Redis round trip. The catch is that the hit only counts when the request is genuinely identical, which rules out a lot of LLM workloads. The ones it does fit well are the ones this post is about.

When the 80/20 actually shows up

The pattern works when your input distribution is heavy-tailed and your output for a given input is supposed to be deterministic. A short list:

Classification. Intent detection, content moderation, sentiment scoring, language ID. Same input string, same answer. If the same email subject lands in your queue ten thousand times, you should pay for it once.
Idempotent agent tools. "Summarise this PDF," "extract the JSON from this scrape." Pure functions. Caching them is the same shape as memoising any other pure function.
Eval reruns. CI runs your eval set on every PR. The prompts have not changed; only the model version or system prompt has. Key the cache by both and you re-run only what differs.
RAG fallbacks. The retrieved context is identical for the same question for the next five minutes. Cache the synthesis step, not the retrieval.

Where it does not earn its keep:

Open-ended generation. "Write me a poem about the sea." Every user wants a different poem. Cache hit rate is rounding error, and you would not want a hit if you got one.
Anything stateful. Multi-turn chat where the conversation is the input. The hash space is unbounded, and the next message will not match.
Anything personalised. The user id is in the prompt; the cache key is per user; you bought yourself a giant Redis bill and a hit rate close to zero.

If you cannot draw the input distribution and point at a fat head, response caching will not save you money. If you can, it usually does.

The keying strategy is the whole design

The hash is the part you have to get right. Every field you include is a promise that two requests sharing those values will produce the same response; every field you leave out is a bet that it does not affect the output. Get it wrong and you serve a stale answer for a request that is not actually the same.

The fields that must be in the key:

The model id, including the version suffix. model-vX and model-vY are not the same function, even when only the patch number moves.
The full message list, serialised in a stable order. JSON with sorted keys; do not rely on dict insertion order across Python versions.
The system prompt.
The tool definitions, if you pass any. A tool change is a behaviour change.
The sampling parameters: temperature, top_p, max_tokens, seed if your provider supports it.

The fields that must not be in the key, or the hit rate goes to zero:

The request id. The trace id. Anything per-call.
The wall clock. Today's date in the system prompt, if you are tempted, will tank your hit rate.
The user id, unless the response is genuinely user-specific (in which case you might be solving the wrong problem with this tool).

The other rule: only cache when the call is supposed to be deterministic. That means temperature=0, and ideally seed set to a fixed value. Caching at temperature=0.7 will store one of many valid answers and serve it forever. If the user sees the same "creative" answer twice, it stops feeling creative.

The 70-line implementation

Redis as the store. SHA-256 as the hash. SETNX to guard against cache stampedes. That is the case where ten parallel workers all miss the same key at the same time and all call the model. You want one of them to win the call and the rest to wait for the result.

import hashlib
import json
import time
from dataclasses import dataclass
from typing import Any, Callable

import redis

r = redis.Redis(decode_responses=True)

CACHE_TTL = 24 * 60 * 60
LOCK_TTL = 30
WAIT_POLL = 0.1
WAIT_MAX = 20.0

Two TTLs. The result lives a day; the lock lives only as long as a worst-case model call. Tune both to your workload.

The key builder is the load-bearing function. Stable JSON, sorted keys, every field that affects the response.

def cache_key(
    model: str,
    messages: list[dict],
    system: str | None,
    tools: list[dict] | None,
    temperature: float,
    max_tokens: int,
    seed: int | None,
) -> str:
    if temperature != 0:
        return ""
    payload = {
        "model": model,
        "messages": messages,
        "system": system or "",
        "tools": tools or [],
        "temperature": temperature,
        "max_tokens": max_tokens,
        "seed": seed,
    }
    raw = json.dumps(
        payload, sort_keys=True, separators=(",", ":")
    )
    digest = hashlib.sha256(raw.encode()).hexdigest()
    return f"llmcache:v1:{digest}"

temperature != 0 returns an empty string and the caller treats that as "do not cache." Better to skip than to lie.

The lookup-or-call function with the SETNX gate:

@dataclass
class CacheStats:
    hits: int = 0
    misses: int = 0
    waits: int = 0

def cached_call(
    key: str,
    call_model: Callable[[], dict[str, Any]],
    stats: CacheStats,
) -> dict[str, Any]:
    if not key:
        return call_model()
    cached = r.get(key)
    if cached is not None:
        stats.hits += 1
        return json.loads(cached)
    lock_key = f"{key}:lock"
    if r.set(lock_key, "1", nx=True, ex=LOCK_TTL):
        try:
            result = call_model()
            r.setex(
                key, CACHE_TTL, json.dumps(result)
            )
            stats.misses += 1
            return result
        finally:
            r.delete(lock_key)
    deadline = time.time() + WAIT_MAX
    while time.time() < deadline:
        time.sleep(WAIT_POLL)
        cached = r.get(key)
        if cached is not None:
            stats.waits += 1
            return json.loads(cached)
    stats.misses += 1
    return call_model()

The function has three paths. A hit comes straight back from Redis. The winner of the lock calls the model and writes the result. Everyone else waits for that result with a deadline so a dead worker does not block forever. If the wait times out, the loser falls through and calls the model itself. That is cheaper than stalling every caller behind a single dead lock.

The wrapper for the actual SDK call:

def model_call_factory(client, **req):
    def call():
        resp = client.messages.create(**req)
        return {
            "text": resp.content[0].text,
            "usage": {
                "input": resp.usage.input_tokens,
                "output": resp.usage.output_tokens,
            },
            "model": resp.model,
            "stop_reason": resp.stop_reason,
        }
    return call

You serialise only what you need. Storing the entire SDK response object is tempting and a bad idea. The schema changes between SDK versions, and you do not want a deserialisation crash on a cache hit six months from now. Pick the fields your code actually reads.

Putting it together:

def ask(client, **req) -> dict[str, Any]:
    key = cache_key(
        model=req["model"],
        messages=req["messages"],
        system=req.get("system"),
        tools=req.get("tools"),
        temperature=req.get("temperature", 1.0),
        max_tokens=req["max_tokens"],
        seed=req.get("seed"),
    )
    return cached_call(
        key, model_call_factory(client, **req), STATS
    )

Seventy lines. One Redis dependency. A hit rate you can read off STATS and graph next to your model spend.

The cost math, with hedges

Pricing changes; the Anthropic pricing page is the only source of truth, and rates move enough that any number a blog post hard-codes will be wrong by the time you read it. Re-check before you forecast.

What does not change is the algebra. If your raw-call cost is C per request and your cache hit rate is h, your effective cost per logical request is C * (1 - h) + redis_cost. Redis is cheap enough on the call-path that you can treat it as a rounding error against a model call. So a 60% hit rate gives you 40% of the bill. An 80% hit rate gives you 20% of the bill. The fat head of a classifier's input distribution is what makes the second number realistic for some workloads; chat workloads are open-ended enough that the head never gets fat, so 80% stays unreachable.

The honest accounting still has to subtract the cost of being wrong. Every cache hit is one missed opportunity for the model to give a fresh answer. If your model gets better between the cache write and the cache read, your users see the old answer. The TTL is your knob: a one-day TTL on a moderation cache is fine, but on a "what is the latest news" cache it ships yesterday's headlines and quietly tanks user trust.

What to instrument

Three numbers, on a dashboard, before you call it shipped:

Hit rate. hits / (hits + misses) over a rolling window. If it is below 30%, the cache is paying for itself but barely; below 10% you are running Redis for sport.
Stampede rate. waits / misses. A healthy number is small: a few percent on a hot key. If it climbs, your LOCK_TTL is too long, or your model call is timing out under the lock and leaving a stale lock in place.
Eviction or expiry rate. How often you write a key that gets evicted before any read. High eviction means your TTL is too long for your Redis size; cache is doing more work than it is paying back.

The LLM Observability Pocket Guide has a chapter on this exact dashboard — what to attach to your traces so a hit-rate regression shows up before the bill does.

What to try on Monday

Pick one workload from the list at the top of the post — the moderation classifier, the intent router, the eval rerun job — and wrap its model call in cached_call. Ship the three numbers from the previous section to whatever dashboard you already use. If the hit rate climbs above 30% in a day, write an RFC for the rest of the team. If it does not, the request was never deterministic in the first place, and your next move is Anthropic's prompt caching or a smaller model rather than this pattern.

If this was useful

The LLM Observability Pocket Guide covers the rest of the cost-and-cache stack: how to key requests so the hit rate is real, how to wire stampede protection without leaving stale locks, and which signals to attach to your traces so a 60% hit rate does not quietly drift to 6% the week your prompt template changes.