Ravi Patel

Posted on Jun 13 • Originally published at ssimplifi.com

Cache invalidation strategies for LLM APIs: TTL, prompt-version, semantic threshold

#llmcache #cacheinvalidation #ttl #promptversioning

Phil Karlton's famous line — "There are only two hard things in computer science: cache invalidation and naming things" — applies to LLM caches with extra weight, because the consequence of a stale response isn't just a wrong number on a screen. It's a customer being told something untrue with confidence, attributed to your product. Production LLM cache invalidation rests on four strategies stacked together: TTL by workload class (the cheap default), prompt-version keying (the model-update story), semantic-threshold tuning (the false-positive control), and explicit purge (the emergency lever). This post walks through each strategy, when it applies, and the trade-offs that make some combinations work better than others. Written for engineers operating LLM caches in production, not for theorists.

The parent guide AI API caching covers the cache layers and the economics. This article goes one level into the invalidation discipline that keeps those caches from poisoning your application.

What we're actually trying to prevent

Three failure modes that invalidation has to handle:

1. Stale-but-still-true content. Your refund policy changed on Tuesday. The cache holds Monday's answer. Users asking about the policy on Wednesday get yesterday's answer. Not catastrophically wrong, but materially stale.

2. Stale-and-now-false content. Your support bot cached a response that referenced an integration that no longer exists. Users follow the cached instructions, hit broken endpoints, blame the product. Worse than stale — actively misleading.

3. False-positive semantic matches. Two semantically-distinct prompts embed close enough to cross the cache threshold, and the cache serves Response A to a user who asked Question B. The wrong-answer-confidently failure mode, particularly insidious because the cache never realises it's wrong.

The four strategies below address these failure modes in overlapping ways. Most production deployments use all four; the question is where each one fits.

Strategy 1 — TTL by workload class

The cheap default, and the right answer for the majority of LLM cache entries.

The mechanic: every cache write attaches an expiration timestamp. Reads after the expiration miss and re-populate from the model. The TTL value is the dial — short for time-sensitive content, long for stable content.

The discipline that matters: TTL by workload class, not a single global default. Different traffic shapes need different freshness guarantees.

Workload class	Suggested TTL	Reasoning
Real-time / market-sensitive (pricing, stock data, current events)	1–5 minutes	Truly time-sensitive; cache hits past this window risk being stale-and-wrong
User-personalised (session-context-heavy)	15 minutes – 1 hour	Personalisation invalidates faster than knowledge content; bounded session staleness is the right framing
Operational chat (support FAQ, help content)	1–6 hours	Source-of-truth content (the FAQ itself) updates on a slow cadence; cache can outlast individual conversations
Documentation / knowledge base	6–24 hours	Reference content that updates daily at most; cache amortises substantial volume per write
Stable definitional content (glossary, terminology)	24–72 hours	Updates measured in weeks; cache turnover is the dominant cost-saver

The pattern that holds up in production: per-project or per-feature TTL configuration. A single project's "answer support questions" feature gets a 4-hour TTL; the same project's "fetch live order status" feature gets a 60-second TTL. The cache backend (Redis) supports per-key TTL natively; the configuration lives in the application layer.

Default TTL is rarely the right answer for any specific workload, but it's the right answer for the first workload. Ship with a conservative default (1 hour is the Prism default), then tune per-workload as patterns emerge.

Strategy 2 — Prompt-version keying

The strategy that addresses model + prompt updates without an explicit purge step.

The problem: you ship a new system prompt on Tuesday. The cache holds responses generated against the old system prompt. New requests dispatch with the new system prompt; if the cache fingerprint doesn't include the system prompt, they hit stale entries and the model's behaviour change is invisible.

The mechanic: the cache fingerprint (covered in prompt cache fingerprinting pitfalls) includes the system prompt by default. When the system prompt changes, the fingerprint changes; the new requests miss the old cache and populate new entries. Old entries age out via TTL.

This is "implicit versioning" — the prompt content is the version key. It works perfectly for the common case (system prompt changes mean cache turnover).

The edge case where this fails: the system prompt is templated with stable structure but variable injected content (e.g. a user-specific name, a current timestamp). The fingerprint changes per request even when the underlying behaviour is identical. Hit rate collapses.

The fix for the edge case: explicit prompt versioning. Instead of including the full system prompt in the fingerprint, include a version identifier:

def fingerprint_with_version(request, prompt_version: str) -> str:
    canonical = canonicalise(request)
    # Replace the system message content with the version identifier
    # so the cache fingerprint is stable across personalised system prompts
    if canonical["messages"] and canonical["messages"][0]["role"] == "system":
        canonical["messages"][0] = {"role": "system", "content": f"__v={prompt_version}"}
    return hashlib.sha256(json.dumps(canonical, sort_keys=True).encode()).hexdigest()

When the underlying system prompt template changes, increment the version. The cache fingerprint moves; old entries age out via TTL; new entries populate against the new version.

The discipline: the version identifier has to actually update when the prompt template updates. Tie it to a build-time constant (e.g. a git commit hash for the prompt-templates file) so it can't drift silently.

Strategy 3 — Semantic threshold tuning

The invalidation strategy for the Layer 2 (semantic) cache specifically.

The mechanic: the semantic cache returns a hit when cosine similarity between the new request's embedding and a stored embedding exceeds a threshold. Threshold is the dial that controls the trade between hit rate and false-positive rate. Higher threshold = fewer hits but higher correctness; lower threshold = more hits with more risk of returning the wrong response.

Why this is an invalidation strategy: raising the threshold invalidates (in the sense of "no longer serves from") entries that would have matched at the old threshold. It doesn't delete the entries; it just stops returning them. Cache hit rate drops; correctness rises.

The discipline: sampled validation per threshold setting.

Run the cache at threshold T (start at 0.95)
Periodically pull 100 random hits
Have a human (or a stronger LLM-as-judge) verify whether the cached response was appropriate to the new prompt
Compute false-positive rate
If FP rate <2% → consider lowering threshold to recover more hits
If FP rate >5% → raise threshold

This is the active form of invalidation. The cache entries don't go away; you change the rule for when they count.

The threshold doesn't have to be global. Per-project threshold (Prism's X-Prism-Cache-Threshold header on Pro+) lets narrow-domain workloads run aggressive thresholds (e.g. 0.92 for a single-product FAQ chatbot) while broad-domain workloads stay conservative (0.96+ for general-purpose chat).

The deeper threshold tuning discipline is covered in exact vs semantic caching for LLMs.

Strategy 4 — Explicit purge

The emergency lever. Used when something material changes and you don't want to wait for TTL to roll over.

Scenarios where explicit purge is the right call:

The source-of-truth content changed mid-day and you want the cache to reflect it immediately
A bad prompt deployment populated the cache with wrong answers; you want to flush before users notice
A user reported a wrong cached response; you want to evict the specific entry before further hits
Regulatory or compliance reason for clearing customer-specific data on demand (GDPR right-to-deletion adjacent)

The mechanic: the cache backend supports key-pattern-based deletion. Redis: SCAN for matching keys + DEL. Vector DBs: namespace-scoped delete or per-vector delete.

Production patterns:

Tag-based eviction. Cache entries carry tags at write time (e.g. feature=support-bot, project=acme). The eviction operation purges all entries matching a tag. Cleanest for "purge everything for Project Acme" or "purge everything for the support-bot feature."
Per-fingerprint eviction. Single-entry delete by cache key. Useful for "this one cached response was wrong; remove it."
Wholesale flush. FLUSHDB or equivalent. Nuclear option; rarely the right answer in production because you lose the warming benefit of every entry, not just the bad ones.

The trade with explicit purge: it requires application-layer awareness of what to purge. Tag the entries at write-time; key the purge by the same tags. Most production deployments use explicit purge sparingly because TTL handles most cases automatically.

How the four strategies stack

Each strategy addresses a different failure mode; production deployments run all four together.

Failure mode	Primary strategy	Secondary
Stale-but-still-true (source content updated on a known cadence)	TTL by workload class	Explicit purge if cadence is uncertain
Stale-and-now-false (deployment changed prompt or model behaviour)	Prompt-version keying	TTL as the backstop
False-positive semantic match (cache returns wrong content)	Semantic threshold tuning	Explicit purge of the specific bad entry
Compliance-driven deletion (GDPR, user data removal)	Explicit purge by tag	n/a

The pattern: TTL is the default; prompt-version handles model/prompt changes; threshold-tuning handles the false-positive risk specific to semantic; explicit purge handles emergencies and the compliance edge cases. Skipping any one of them creates a gap.

Two anti-patterns that look like invalidation but aren't

1. "Set TTL to 5 minutes everywhere." Defensible-sounding, harmful in practice. A 5-minute TTL means most cache entries never get a second hit (most requests aren't repeated within 5 minutes), so the cache hit rate collapses to near-zero. The cost of caching infrastructure is constant; the savings drop proportionally. Net result: paying for cache infra without getting the savings. Default TTL should match workload class, not anxiety.

2. "Purge the cache on every deployment." Common in startup deployments because it feels safe. The downside is that every deploy invalidates a fully-warmed cache; the hit rate goes to zero and takes hours-to-days to recover. If your prompts didn't change in the deploy, the cache was still correct. Use prompt-version keying instead — purge only when the prompts change, not on every git push.

Both anti-patterns substitute apparent safety for actual cache effectiveness. The four strategies above let you have both.

Operational discipline

The patterns that hold up over time:

Per-key cache observability. Every cache entry should be inspectable: key, value, write timestamp, access count, last access. Prism's /dashboard/cache inspector surfaces this for Pro+ accounts. Without per-key visibility, debugging "why is the cache returning that?" is guesswork.

False-positive sampling cadence. Run a sampled validation pass on Layer 2 hits weekly during initial deployment, monthly once you're stable. Track false-positive rate as a first-class metric alongside hit rate.

Prompt-version increment discipline. Tie the version identifier to a build-time constant. Lint against changes to the prompt template without an accompanying version bump.

TTL revision cadence. Audit per-workload TTL quarterly. As workload patterns evolve (e.g. a previously-stable knowledge base starts updating more frequently), the TTL should follow.

Purge audit log. Every explicit purge should write an audit entry: who, when, what pattern, what triggered. Useful for post-mortems when "the cache started returning wrong things last Tuesday" investigations come up later.

How Prism implements invalidation

Prism's cache invalidation runs the four-strategy stack:

Default TTL: 1 hour. Per-project configurable on Pro+ (X-Prism-Cache-TTL header, range 60s–30d on Team tier; 60s–7d on Free + BYOK once v1.9 ships).
Prompt-version keying via fingerprint. The system prompt content is part of the SHA-256 fingerprint by default, so prompt changes invalidate the relevant cache slice automatically. Customers who use templated system prompts with injected variables can adopt explicit version keying via the X-Prism-Cache-Version header (rolling out in v1.9 alongside BYOK).
Semantic threshold: default 0.95. Per-project tuning on Pro+ via X-Prism-Cache-Threshold header. The cache inspector at /dashboard/cache surfaces hit-rate-at-threshold curves so customers can model the impact of tuning before committing.
Explicit purge: the cache inspector supports per-key delete + per-tag eviction. Pro+ accounts can purge their own project's cache from the dashboard.

VERIFY (founder): confirm the X-Prism-Cache-Version header naming and roll-out timing (planned for v1.9 alongside BYOK per the publishing plan, or different?). Confirm X-Prism-Cache-TTL range bounds against the current tier matrix.

Decision framework

If you're standing up cache invalidation discipline on a real workload:

Start with a 1-hour default TTL. Adjust per workload class once patterns emerge.
Include the full system prompt in your fingerprint. It's the right default; explicit versioning is an edge-case escape hatch.
Default semantic threshold to 0.95. Don't tune by intuition — validate by sample.
Wire explicit purge before you need it. Per-tag eviction is the most-useful primitive; build it once, use it sparingly.
Make the cache inspectable. Per-key visibility is what turns "the cache returned that?" from a 30-minute investigation into a 30-second one.
Run sampled false-positive validation weekly during ramp-up. Treat false-positive rate as a primary metric alongside hit rate.

The invalidation discipline pays off compounded with the rest of the caching stack. A cache that hits aggressively but invalidates correctly is the production-shape that delivers the 30-60% bill reduction the literature promises; one or the other alone doesn't.

Where to go next

For the parent caching framework: AI API caching. For the fingerprinting discipline that makes Layer 1 hits land: prompt cache fingerprinting pitfalls. For the Layer-2 threshold tuning detail: exact vs semantic caching for LLMs. For backend infrastructure choice: Redis vs vector cache for LLM responses.

For modelling caching impact on your workload: cache hit rate estimator.

FAQ

What's the right TTL if I'm just getting started?

1 hour as the default. It strikes a defensible balance — long enough for the cache to compound hits, short enough that staleness is bounded. Tune per workload after a few weeks of production traffic surface which slices need shorter or longer.

Should I purge the cache when I deploy a new model version?

Only if the new model's behaviour differs from the old in ways that matter. If you're upgrading Claude Sonnet 4 to Claude Sonnet 4.7, the responses are similar enough that existing cache entries are typically still acceptable. If you're switching from Claude to GPT, the responses differ structurally and a purge is the right call. The principle: purge when the answers change, not when the model changes.

Does TTL apply to both Layer 1 (exact) and Layer 2 (semantic)?

Yes — both layers support TTL natively. Some teams set Layer 2 TTL higher than Layer 1 because semantic entries are more valuable per-entry (each catches more variations). Prism defaults to 1 hour on both with per-layer tuning available.

How do I handle GDPR right-to-deletion in the cache?

Tag cache entries with the user ID (or hash thereof) at write time; explicit purge by user-ID tag when a deletion request comes in. The cache is downstream of the request data, so cleaning it up on deletion requests is a hygiene item — not legally separate from the broader application data cleanup.

What's the operational cost of a wholesale cache flush?

Real but recoverable. After flush, every request misses and hits the provider. Hit rate climbs back as the cache re-warms; for a busy workload this takes hours; for a slow workload, days. Use wholesale flush only when you've established that the cache is genuinely poisoned and can't be selectively purged.

Can I version the cache by prompt template hash automatically?

Yes — and that's actually the cleanest pattern. Compute a hash of your prompt-template file at build time; include the hash as the implicit version key. When the template file changes, the hash changes, the cache fingerprints differ, old entries age out via TTL. No manual version increment, no risk of drift.

Should the semantic-cache threshold change over time?

It might, as your traffic mix evolves. A workload that was safe at threshold 0.93 with narrow-domain users may need to move to 0.95 as you onboard broader user cohorts. Audit threshold quarterly against false-positive rate; tune when the data justifies.

Cache invalidation is hard because it has to balance four objectives at once: freshness, correctness, hit rate, and operational simplicity. The four strategies stack to address each. Read the AI API caching guide for the layered cache strategy these invalidation patterns sit inside.

DEV Community