- Book: Database Playbook
- Also by me: LLM Observability Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
The bill was $41,200 in March. The team had been staring at the dashboard for two weeks, trying to figure out which feature flag to revert. None of them, it turned out. The fix was a Postgres migration with six lines of SQL and a twenty-line patch to one handler. (Figures shared with me by the team; I have not independently audited their billing.)
This is a pattern teams keep rediscovering and nobody bothers to write down properly: a small answer cache table in your existing Postgres, keyed by a hash of the prompt, with a TTL. You short-circuit obvious repeats before they ever leave your VPC. The April bill landed at $19,800, and the team traced the drop to a single migration.
The pattern is well-documented at this point. ProjectDiscovery published a detailed teardown showing a 59% LLM cost reduction from prompt caching, and AWS has a reference architecture for response caching that is essentially the same shape: hash, lookup, return-or-call. None of it requires Redis. None of it requires a vector database. You almost certainly already run Postgres.
The shape of the waste
Look at any production LLM workload for a week and you will find three classes of request:
- Truly novel queries that need a model call.
- Near-duplicates of yesterday's queries, with whitespace or punctuation drift.
- Exact byte-identical repeats, usually from retries, page reloads, polling, or a cron job that forgot it ran.
Class 3 is the embarrassing one. Run a GROUP BY prompt_hash over a week of your request log and you will likely find a double-digit percentage of calls are exact-match retries from a frontend that re-sends on connection wobble. Another slice will be nightly jobs that re-summarize the same documents because nobody wired up a "did this change" check.
Class 2 is the one most teams reach for fancy semantic-cache tooling to solve. You don't need to. Class 3 alone is often half the bill, and a Postgres table with a hash column kills it cold.
The migration
Six lines that do real work, plus whatever boilerplate your migration tool adds:
CREATE TABLE llm_response_cache (
prompt_hash BYTEA PRIMARY KEY,
response JSONB NOT NULL,
model TEXT NOT NULL,
expires_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX ON llm_response_cache (expires_at);
That is the whole thing. prompt_hash is the SHA-256 of the canonicalized prompt plus the model name plus any parameters that affect output (temperature, top_p, system prompt version). JSONB keeps the full response payload, including token counts, so you can replay it through your normal logging path. expires_at is your TTL — most teams settle on 24 hours for chat features and 7-30 days for batch summarization.
The secondary index on expires_at exists for one reason: a nightly DELETE FROM llm_response_cache WHERE expires_at < NOW() that keeps the table from growing forever. Run it as a cron, run it as pg_cron, run it from your scheduler. It does not matter. Just run it.
A few production notes that will save you a debugging session:
- Hash the canonical form of the prompt. Strip trailing whitespace, normalize line endings, lowercase if your model is case-insensitive for your domain. Get this wrong and you will sit at a 0% cache hit rate for two days while your frontend silently appends a hidden newline.
- Include the model name in the hash. The same prompt on
gpt-4oandgpt-4o-minireturns different things. You do not want to serve a mini response on a full-model request. - Include any system-prompt version. If you change the system prompt and forget to bump the version, you will serve last week's behavior to this week's users.
- Do not include the user ID in the hash unless your responses are user-scoped. The whole point is that two users asking the same question hit the same cache entry.
The handler change
The Python below is the actual shape of the patch — about thirty lines. It assumes you have an asyncpg pool and a function call_model(prompt, model) that returns a dict.
import hashlib
import json
from datetime import datetime, timedelta, timezone
CACHE_TTL = timedelta(hours=24)
def prompt_hash(prompt: str, model: str, sys_prompt_v: str) -> bytes:
canon = prompt.strip().replace("\r\n", "\n")
key = f"{model}|{sys_prompt_v}|{canon}".encode("utf-8")
return hashlib.sha256(key).digest()
The canonicalization is the load-bearing part: strip whitespace, normalize line endings, and bake the model name and system-prompt version into the key. Get any of those wrong and your hits go to zero on the next deploy. Now the handler:
async def cached_complete(pool, prompt, model, sys_prompt_v):
h = prompt_hash(prompt, model, sys_prompt_v)
async with pool.acquire() as conn:
row = await conn.fetchrow(
"SELECT response FROM llm_response_cache "
"WHERE prompt_hash = $1 AND expires_at > NOW()",
h,
)
if row is not None:
return json.loads(row["response"]), True
result = await call_model(prompt, model)
await conn.execute(
"INSERT INTO llm_response_cache "
"(prompt_hash, response, model, expires_at) "
"VALUES ($1, $2, $3, $4) "
"ON CONFLICT (prompt_hash) DO UPDATE "
"SET response = EXCLUDED.response, "
"expires_at = EXCLUDED.expires_at",
h,
json.dumps(result),
model,
datetime.now(timezone.utc) + CACHE_TTL,
)
return result, False
The boolean return tells the caller whether it was a cache hit. You want that, because it goes straight into a metric. Without instrumentation, you will not know if your cache is working. You will only know that the bill is or is not lower, and by then it is too late to fix.
The savings shape
Here is what the realistic savings curve looks like, using snapshots the same team reported back to me at each TTL setting on a B2B chat product running roughly 2.5M model calls per month before the migration.
| Cache strategy | Hit rate | Monthly LLM cost | Notes |
|---|---|---|---|
| None | 0% | ~$41K | Baseline |
| Exact-hash, 1h TTL | 18% | ~$34K | Easy to reason about |
| Exact-hash, 24h TTL | 41% | ~$24K | Common sweet spot |
| Exact-hash, 7d TTL | 53% | ~$19K | Risk: stale answers |
Source: snapshots shared by one B2B chat team at ~2.5M calls/month, 2025. Team-reported numbers, not an independent benchmark.
A 41% hit rate at 24h is normal for a chat product where users iterate on queries within a session. Batch summarization workloads frequently hit 80% or more, because the same documents get re-processed on every run. Redis published similar numbers (70%+ hit rates on high-repetition workloads), but they sell a cache, so you'd expect that. The shape holds even when you do it in the database you already pay for.
The other thing to notice: the curve flattens fast. Going from 24h to 7d gets you another ~$5K but starts producing user-visible staleness in chat. Most teams settle on 24h for interactive features, 7d for summarization, and a manual DELETE button for the support team to invalidate by hash when something needs to change.
What this is not
A few things this pattern is not, before someone in the comments writes them.
It is not semantic caching. Semantic caching matches near-duplicates by embedding similarity, which is a real technique with real upside, and a real failure mode where it confidently serves the wrong answer to a question that was almost the right shape. Exact-hash caching can only be wrong if you forget to bump the hash inputs when the system prompt changes, which is why instrumentation matters.
It is not prompt-prefix caching as offered by Anthropic and OpenAI. Those work at the model layer and reduce the cost of the prefix tokens on every call. They compose with this pattern. You can have both. Prefix caching cuts your per-call cost; response caching skips the call entirely.
It is not a substitute for rate limiting, idempotency keys, or request deduplication at the API gateway. Those are still your job. The cache catches what slips through the rest of the stack.
Instrumenting the thing
You shipped it. Now you need to know it is working. Three counters and a histogram are enough.
# pseudocode — wire to whatever you use
metrics.counter("llm.cache.hit").inc()
metrics.counter("llm.cache.miss").inc()
metrics.counter("llm.cache.write").inc()
metrics.histogram("llm.cache.lookup_ms").observe(lookup_ms)
Plot hit rate as hits / (hits + misses). Alert if it drops below your baseline by more than 10 points over an hour — that usually means a deploy changed the hash inputs and your cache is silently full of orphaned rows. Add a request ID to the system prompt to help with debugging and you will lose two days of cache hits before anyone notices.
Add one Postgres metric: pg_relation_size('llm_response_cache'). If it grows monotonically, your TTL cleanup is broken. Fix it before someone pages you about disk.
When this is the wrong move
If your prompts are heavily user-personalized and never repeat across users, the hit rate will be near zero and you are better off investing in prompt-prefix caching, smaller models for cheap paths, or a router. If your responses must be fresh per request — financial advice, real-time data lookups, anything with a regulatory clock — the cache is a liability, not an asset. If you are at single-digit thousands of calls per month, the engineering time to build this exceeds the savings.
For everyone else: pull a week of your request log, group by prompt hash, and ship the migration before the next bill cycle closes.
If this was useful
The Database Playbook covers when Postgres is the right place for caches, queues, and audit tables — the patterns that keep showing up in production because the database you already run is good enough for problems people keep buying new tools to solve. The LLM Observability Pocket Guide is the companion: what to put on a model-call span, how to spot a regressing cache, and how to alert on cost before it surprises the CFO.
Top comments (0)