Ravi Patel

Posted on Jun 9 • Originally published at ssimplifi.com

Prompt cache fingerprinting pitfalls: the discipline that makes exact-match caching actually hit

#ai #caching #fingerprinting #llminfrastructure

The promised hit rate of an exact-match LLM cache is 5-15% on real production traffic. Most teams that deploy one see hit rates near zero for the first few weeks and assume caching doesn't work for their workload. It almost always works; the cache is just being defeated by trivial request variations that fingerprint differently even though they should hit the same key. This post is the discipline that closes that gap — the seven normalisation pitfalls that break naive cache implementations, with the fix patterns that hold up under production traffic.

The parent guide on AI API caching covers the cache layers and economics; this article goes one level deeper into the fingerprinting discipline that makes Layer 1 (exact-match) actually work.

What fingerprinting is supposed to do

An exact-match cache stores responses keyed by a deterministic identifier — almost always a SHA-256 hash over a canonicalised representation of the request. When a new request arrives, you compute the same hash; if the key exists, return the cached response. The cache is provably correct because the fingerprint guarantees byte-equivalence at the input.

The fingerprint is supposed to capture everything that affects the response and exclude everything that doesn't. The two boundaries are where most teams get into trouble. Including too little misses real cache hits; including too much misses cache hits that should land. Including the wrong things (timestamps, request IDs, user metadata) splits the cache into shards of one entry each.

The first principle: two requests that would produce the same response should fingerprint to the same hash. Everything below is in service of that single rule.

Pitfall 1 — Non-deterministic JSON serialisation

The most common bug. Python's json.dumps doesn't guarantee field ordering by default. JavaScript's JSON.stringify orders object keys by insertion order, which depends on how the object was constructed. Two requests with identical content but different field-insertion order serialise to different strings and hash to different keys.

# Two semantically-identical requests
req_a = {"model": "claude-sonnet", "messages": [...], "temperature": 0.7}
req_b = {"temperature": 0.7, "messages": [...], "model": "claude-sonnet"}

# Naive serialisation hashes them differently
hash_a = hashlib.sha256(json.dumps(req_a).encode()).hexdigest()
hash_b = hashlib.sha256(json.dumps(req_b).encode()).hexdigest()
assert hash_a != hash_b  # cache miss on what should be a hit

The fix: always pass sort_keys=True to json.dumps. In JavaScript use a canonical-JSON library or explicitly sort keys before stringifying. Treat this as non-negotiable across every codepath that computes a cache fingerprint.

canonical = json.dumps(req, sort_keys=True, separators=(",", ":"))
fingerprint = hashlib.sha256(canonical.encode("utf-8")).hexdigest()

The separators argument removes any whitespace inserted by default — another source of inconsistency between Python versions and serialiser configurations.

Pitfall 2 — Optional fields appearing inconsistently

Most LLM SDK clients send only the fields the caller explicitly set. The OpenAI SDK doesn't include temperature: 1.0 if the caller doesn't pass it, even though 1.0 is the implicit default. One request has {"model": "gpt-5-4", "messages": [...]}; another has {"model": "gpt-5-4", "messages": [...], "temperature": 1.0}. Same effective request to the model; different fingerprints.

req_a = {"model": "gpt-5-4", "messages": [...]}                    # temperature omitted
req_b = {"model": "gpt-5-4", "messages": [...], "temperature": 1.0} # temperature explicit
# Both produce the same model output, but they hash differently

The fix: before fingerprinting, normalise to a canonical form by applying defaults for every relevant field. If temperature is unset, set it to 1.0. If top_p is unset, set it to 1.0. If max_tokens is unset, set it to your default (commonly 4096 in OpenAI, varies per provider). The fingerprint runs against the post-normalisation request.

Document the defaults table somewhere visible — the discipline is fragile when defaults are spread across multiple files.

Pitfall 3 — Including non-functional fields

OpenAI requests can include a user field for abuse-detection. Some applications attach a metadata object with internal tracking data. Many libraries auto-inject a request ID or timestamp. None of these change the model's output, but if any of them land in the fingerprint, every request fingerprints uniquely and the cache hit rate collapses to zero.

# Hash includes a request ID. Every request is unique.
req = {
    "model": "claude-sonnet",
    "messages": [...],
    "_request_id": "req_abc123xyz",  # caller-supplied tracking
    "user": "user_456",
    "metadata": {"trace_id": "..."},
}
# Same prompt re-issued five times → five different fingerprints → 0% hit rate

The fix: maintain an explicit allowlist of fields that go into the fingerprint. Anything not on the allowlist is excluded. The allowlist for chat completions typically looks like:

FINGERPRINT_FIELDS = {
    "model", "messages", "temperature", "top_p", "max_tokens",
    "stop", "tools", "tool_choice", "response_format",
    # NOT: user, _request_id, metadata, idempotency_key, stream
}

Allowlist over denylist. Denylists are fragile — a new SDK version adds a metadata field you didn't anticipate, suddenly the cache splits. Allowlists fail closed (new fields are ignored until you explicitly add them).

Pitfall 4 — Tools array unordered

Function-calling / tool-use requests include a tools array. The model doesn't care about the order of tools — [A, B] and [B, A] produce the same model behaviour because the model sees the full toolset regardless. But the JSON serialisation differs by order, so the fingerprints differ.

req_a = {"messages": [...], "tools": [tool_search, tool_calculator]}
req_b = {"messages": [...], "tools": [tool_calculator, tool_search]}
# Same effective request; different fingerprints

The fix: before fingerprinting, sort the tools array by tool name (or by a canonical identifier per tool). Same applies to the stop array — if stop is a list of strings, sort it. Anything that's a set-shaped data structure but represented as an array needs deterministic ordering.

def canonicalise(req: dict) -> dict:
    out = dict(req)
    if "tools" in out:
        out["tools"] = sorted(out["tools"], key=lambda t: t["function"]["name"])
    if isinstance(out.get("stop"), list):
        out["stop"] = sorted(out["stop"])
    return out

Pitfall 5 — Streaming flag included in the fingerprint

A common subtle bug. The stream parameter doesn't change the model's content — the same prompt produces the same tokens whether you stream them or buffer them into a single response. If the fingerprint includes stream, every streaming call hashes differently from every non-streaming call, and the cache splits into a streaming half and a non-streaming half. Half-empty halves mean half the hit rate.

req_streaming = {"messages": [...], "stream": True}
req_buffered = {"messages": [...], "stream": False}
# Same content; should hash the same

The fix: exclude stream from the fingerprint. Always serve cached responses as non-streaming JSON regardless of the request's stream flag — serving a fake stream from a cache is operationally messy. Same rule applies to stream_options and similar streaming-control fields.

This also fixes a related bug: serving a cached response as a stream that was originally captured as a stream means storing the full SSE event log per cache entry, which bloats storage 2-3x for no benefit.

Pitfall 6 — Whitespace and trailing newlines

Real production traffic accumulates trailing whitespace in user messages. A frontend that does userInput.trim() strips it; another that does userInput leaves it. Same intent; different bytes; different fingerprints. Same applies to "internal whitespace" — "the quick fox" vs "the quick fox" look the same to a human but differ at the byte level.

The judgment call: do you treat whitespace as semantically meaningful? For most LLM workloads it isn't — the model produces the same response to "hello world\n\n\n" as to "hello world". Aggressive normalisation collapses trailing whitespace + collapses runs of internal whitespace to single spaces.

import re

def normalise_message_content(content: str) -> str:
    # Strip leading/trailing whitespace
    content = content.strip()
    # Collapse internal runs of whitespace to single spaces
    content = re.sub(r"\s+", " ", content)
    return content

The trade: workloads where whitespace IS semantic (code generation where indentation matters, formatted-output tasks) need the conservative version (no normalisation) or a per-task-type setting. The right default depends on your workload mix.

For most teams, normalising whitespace lifts hit rate by 2-5 percentage points. The risk is on the code-generation and structured-output slice; that's where to validate.

Pitfall 7 — Extension fields leaking into the hash

Some gateways and SDKs attach extension fields to requests for their own purposes. Prism uses a _prism_cache_control marker in some scenarios; LangChain attaches _lc_serialized payloads when serialising chains; vendor-specific SDKs sometimes inject _anthropic_metadata or similar.

If these extensions don't affect the upstream model call, they don't belong in the fingerprint. If they do affect it (a cache_control block telling the provider to engage prompt caching, for instance), they affect billing but not the response content — still arguably shouldn't be in the fingerprint, since the response is identical with or without it.

The fix: filter extension fields (typically anything prefixed with _) before fingerprinting. Same canonicalisation pass that handles tool-sorting and default-application:

def canonicalise(req: dict) -> dict:
    out = {k: v for k, v in req.items() if not k.startswith("_")}
    # ... rest of normalisation
    return out

For nested structures (the _prism_cache_control marker on a message, for instance), apply the same filter recursively.

The composed canonicaliser

The pattern that holds up in production puts all of the normalisations in one place:

def fingerprint_request(req: dict) -> str:
    canonical = canonicalise(req)
    serialised = json.dumps(canonical, sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(serialised.encode("utf-8")).hexdigest()

def canonicalise(req: dict) -> dict:
    # 1. Allowlist filter
    out = {k: v for k, v in req.items() if k in FINGERPRINT_FIELDS}

    # 2. Apply defaults
    out.setdefault("temperature", 1.0)
    out.setdefault("top_p", 1.0)
    # max_tokens default varies by use case; pick one and document it

    # 3. Canonicalise messages
    out["messages"] = [
        {"role": m["role"], "content": normalise_message_content(m["content"])}
        for m in out.get("messages", [])
    ]

    # 4. Sort set-shaped arrays
    if "tools" in out:
        out["tools"] = sorted(out["tools"], key=lambda t: t["function"]["name"])
    if isinstance(out.get("stop"), list):
        out["stop"] = sorted(out["stop"])

    return out

One function, one place to update, deterministic across all codepaths that ever hash a request. The discipline is the abstraction: every cache write and every cache lookup goes through fingerprint_request. If two callers don't share the same function, they don't share the same cache.

How to know it's working

The signature of correct fingerprinting in production:

Hit rate climbs from near-zero to the 5-15% expected range within a few days of cache warm-up. Workloads with deterministic patterns (cron, evaluation runs) climb fastest.
Per-fingerprint storage doesn't grow unboundedly. If your cache is storing one entry per request — total cache size grows linearly with traffic — the fingerprint is over-specific.
Cache hits never return the wrong response. If they do, the fingerprint is under-specific (something that affects the response isn't in the hash, so different responses share a key). Sample-validate by hand on the first day post-deployment.
Stable across SDK upgrades. If an OpenAI SDK upgrade changes default behaviour and the cache hit rate drops, your canonicaliser missed a new default. Audit and fix.

The discipline pays off because it's the difference between a cache that pays for itself in a week and a cache that's overhead with no return. Most production teams that "tried caching and it didn't work" hit one of the seven pitfalls above. The fixes are mechanical; the result is the hit rate that the literature promises.

How Prism handles it

Prism's services/cache.py runs the canonicalisation above on every request — allowlisted fields, sorted tools, normalised whitespace, stripped extensions, default-applied parameters. The fingerprint runs against the canonicalised request, so cache writes and lookups stay aligned across SDK quirks and customer-side variation.

The discipline that bit us during the v1.1 cache build (and motivated this article): the _prism_cache_control extension marker was originally included in the fingerprint, which split the cache between requests that had it and requests that didn't. The fix was a one-line filter in the canonicaliser; the recovery in hit rate was about 4 percentage points. Small bug, real impact — exactly the shape these pitfalls take in real systems.

For the full caching framework, see the parent AI API caching guide. For the related glossary entry, see cache fingerprinting. For when to combine exact-match with semantic + provider-native passthrough, see exact vs semantic caching.

FAQ

Should I hash the system prompt as part of the fingerprint?

Yes — the system prompt changes the model's response, so it has to be part of the fingerprint. Two requests with different system prompts but identical user messages should fingerprint differently (and they do, since messages[0] differs). The only edge case is when an application transforms system prompts dynamically per request (e.g. injecting a timestamp), which makes the system prompt look "stable" semantically but byte-different. Either lift the dynamic content out of the system prompt or accept the lower hit rate.

What about idempotency keys?

Idempotency keys are caller-supplied metadata for the caller's own deduplication; they don't affect the model response. Exclude from the fingerprint via the allowlist. The cache layer is itself an idempotency mechanism — same fingerprint, same response, by definition.

Does the user field in OpenAI's API need to be in the fingerprint?

No. The user field is metadata for OpenAI's abuse-detection systems; it doesn't change the response content. Exclude.

Does the model's seed parameter belong in the fingerprint?

Yes if you set it. The seed is used to make outputs more reproducible across requests; different seeds with the same prompt can produce different responses. Include in the allowlist.

What if my application uses chat history that varies session-by-session?

The fingerprint captures the full messages array, so two requests with different chat histories fingerprint differently — by design. The implication is that exact-match cache hits on multi-turn chat are rare (the conversation state is unique to the user/session). Semantic caching catches more of this slice; see exact vs semantic caching.

How do I migrate an existing cache when I fix a fingerprinting bug?

You usually don't. The old entries become unreachable (the new fingerprint computes differently), and they age out via TTL. Cache turnover is fast enough that the transition is invisible within a day or two. If TTL is long enough that stale entries linger, do a one-shot purge after the fingerprinting fix lands.

The discipline above is what turns the literature's "5-15% exact-match hit rate" into actual production reality. The AI API caching guide covers the full layered strategy; the savings calculator lets you model what hit rate translates to in dollars on your workload.

DEV Community