- Book: LLM Observability Pocket Guide
- Also by me: AI Agents Pocket Guide
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
Picture the bill that triggers the conversation: roughly $14,820 in a single week, illustrative but not far from what teams actually see. A customer-support copilot, embedding the same FAQ context into every prompt, paying for the same input tokens around 41,000 times across the seven days. The provider had just shipped automatic prompt caching, but suppose the deployment is self-hosted vLLM for compliance reasons. No automatic caching available. The first instinct is to reach for GPTCache. The second instinct, after looking at GPTCache's dependency graph, is to ask if 100 lines of Python would do.
It does. On a typical workload (chat-style copilot, RAG-over-docs, internal tooling), a tiny custom cache pays for the half-day it takes to write within a week. Sometimes a day.
This post is the 100-line cache. The keying scheme, the TTL, the eviction policy, the semantic-similarity fallback, and the math on what it saves. Provider-agnostic on purpose: it sits in front of OpenAI, Anthropic, vLLM, Ollama, anything that takes prompt-in / text-out.
What the providers already give you
Before you write anything, know what's free. OpenAI's automatic prompt caching kicks in at 1,024 tokens, hits in 128-token increments, and gives you 50% off cached input tokens with zero code changes. Anthropic's prompt caching is opt-in via cache breakpoints, charges 1.25× base on cache write for the 5-minute tier and 0.1× on cache reads, and as of February 2026 isolates caches per workspace.
Both are excellent for the prefix-stable case: long system prompt, short variable user message. Neither helps when:
- Your inputs are short (under 1,024 tokens for OpenAI; the cache never engages).
- Two requests are semantically identical but byte-different ("how do I reset my password" vs "i forgot my password how do i reset").
- You're on a self-hosted model with no caching layer.
- You want the response itself cached, not just the input prefix. Different problem.
That last one is the big one. Provider prompt caching saves you input-token cost. A response cache saves you the entire call. On a workflow where 30%+ of queries are duplicates or near-duplicates, and a 2025 vendor analysis from Maxim put that number around 31% for typical LLM apps, that is the bigger win.
The cache, part one: exact-match with TTL and LRU
Here is the first half. Hash key over (prompt, system, model, temperature, max_tokens), in-memory store with a size cap, TTL, LRU eviction:
from __future__ import annotations
import hashlib
import json
import time
from collections import OrderedDict
from dataclasses import dataclass
from threading import Lock
from typing import Callable, Optional
@dataclass(frozen=True)
class CacheKey:
prompt: str
system: str
model: str
temperature: float
max_tokens: int
def hash(self) -> str:
payload = json.dumps(
{
"p": self.prompt,
"s": self.system,
"m": self.model,
"t": round(self.temperature, 4),
"mt": self.max_tokens,
},
sort_keys=True,
ensure_ascii=False,
).encode("utf-8")
return hashlib.blake2b(payload, digest_size=16).hexdigest()
@dataclass
class CacheEntry:
response: str
embedding: Optional[list[float]]
created_at: float
hits: int = 0
class LLMCache:
def __init__(
self,
max_size: int = 5000,
ttl_seconds: int = 3600,
):
self._store: OrderedDict[str, CacheEntry] = OrderedDict()
self._max_size = max_size
self._ttl = ttl_seconds
self._lock = Lock()
def _is_fresh(self, entry: CacheEntry) -> bool:
return (time.time() - entry.created_at) < self._ttl
def get_exact(self, key: CacheKey) -> Optional[str]:
h = key.hash()
with self._lock:
entry = self._store.get(h)
if entry and self._is_fresh(entry):
self._store.move_to_end(h)
entry.hits += 1
return entry.response
if entry:
del self._store[h]
return None
def put(
self,
key: CacheKey,
response: str,
embedding: Optional[list[float]] = None,
) -> None:
h = key.hash()
with self._lock:
self._store[h] = CacheEntry(
response=response,
embedding=embedding,
created_at=time.time(),
)
self._store.move_to_end(h)
while len(self._store) > self._max_size:
self._store.popitem(last=False)
Five things worth pointing out before the second half lands.
The key includes temperature because two calls at temperature 0 and temperature 0.7 to the same prompt are not the same call. Caching the second as the first will hand a deterministic answer to a place that asked for variability. Round to four decimals so floating-point noise doesn't fragment the cache.
The key does not include the request ID, the user ID, or any timestamp. Those would shatter the cache on every call. If your responses are user-specific (e.g., the system prompt encodes user data), the user-specific data lives inside system and is already hashed. If you have responses that must never be shared across users for compliance reasons, namespace by user: make prompt carry a user prefix, or run a per-user cache instance.
OrderedDict plus move_to_end is a 4-line LRU. No need for cachetools, functools.lru_cache (which doesn't do TTL cleanly), or Redis. If you outgrow process-local, swap the dict for Redis with TTL and you're done.
blake2b with a 16-byte digest gives you a 32-character hex key with negligible collision risk and is faster than sha256 on most CPUs. The byte count matters when you have millions of entries.
The Lock is intentional. The OrderedDict.move_to_end plus popitem sequence is not thread-safe on its own. You will use this from a Flask or FastAPI handler, and concurrent requests for the same key are exactly the case where it'll race.
The cache, part two: semantic-similarity fallback
The exact-match cache catches the duplicates. The semantic fallback catches the near-duplicates: the same question asked with different phrasing. This is where the meaningful uplift comes from on chat-style workloads.
import math
def cosine(a: list[float], b: list[float]) -> float:
dot = sum(x * y for x, y in zip(a, b))
na = math.sqrt(sum(x * x for x in a))
nb = math.sqrt(sum(y * y for y in b))
if na == 0 or nb == 0:
return 0.0
return dot / (na * nb)
class SemanticLLMCache(LLMCache):
def __init__(
self,
embedder: Callable[[str], list[float]],
similarity_threshold: float = 0.93,
**kwargs,
):
super().__init__(**kwargs)
self._embedder = embedder
self._threshold = similarity_threshold
def get(self, key: CacheKey) -> Optional[str]:
hit = self.get_exact(key)
if hit is not None:
return hit
query_emb = self._embedder(key.prompt)
best_score = 0.0
best_resp: Optional[str] = None
with self._lock:
for entry in self._store.values():
if not self._is_fresh(entry) or entry.embedding is None:
continue
score = cosine(query_emb, entry.embedding)
if score > best_score:
best_score, best_resp = score, entry.response
if best_score >= self._threshold:
return best_resp
return None
def put_with_embedding(
self, key: CacheKey, response: str
) -> None:
emb = self._embedder(key.prompt)
self.put(key, response, embedding=emb)
def cached_call(
cache: SemanticLLMCache,
key: CacheKey,
call_llm: Callable[[CacheKey], str],
) -> str:
hit = cache.get(key)
if hit is not None:
return hit
response = call_llm(key)
cache.put_with_embedding(key, response)
return response
That's the whole thing. Let's pick at the parts that matter.
The 0.93 threshold is not a guess. Lower than 0.9 and you start serving "how do I cancel" answers to "how do I pause": semantically close in embedding space, operationally catastrophic. Higher than 0.96 and the cache barely helps; only typo-level variants hit. 0.92 to 0.94 is the band where most teams land after measuring on their own corpus. Your number lives or dies on the embedding model and the domain. Measure before committing.
Linear scan is fine. A 5,000-entry cache with 1,536-dim embeddings cosine-scans in under 5ms on a single core. If you push past 50,000 entries and the scan starts hurting, swap to FAISS or a small in-process HNSW index. Don't pre-optimize.
Embeddings only on put_with_embedding. You're paying a cheap embedding call (think text-embedding-3-small at roughly $0.02/M tokens per OpenAI's pricing page as of early 2026) to save expensive completion calls. The economics lopside hard: an embedding for a typical query is roughly 0.3% the cost of the completion it might replace.
The fallback only fires after exact-match misses. Always cheap path first. The semantic search is the expensive path; it should be skipped whenever the exact-match layer hits, which on most workloads accounts for the majority of cache hits.
What it actually saves
Illustrative figures based on public GPT-4o pricing as of early 2026 ($2.50/M input, $10/M output); your numbers will vary with model, traffic shape, and pricing changes. A customer-support copilot, 8,000 conversations a week, average 6 turns (so ~48,000 calls), average input 1,400 tokens, output 280 tokens. Per-call cost runs about $0.0063, so without caching the weekly completion bill lands near $302.
Drop in the exact-match cache. On a workload like this, exact-match hit rate often runs around 22% (duplicates from common questions, retries, refresh hits). Cost drops to about $236/week. Cache costs nothing extra; the lookup is in-memory.
Add the semantic layer at threshold 0.93. Combined hit rate climbs to roughly 41%. Cost drops to about $178/week. Embedding cost added: about $4/week. Net savings versus baseline: roughly $120/week on this scenario.
The shape that matters: the cache pays for the half-day of writing it inside the first week, then keeps paying. The savings scale linearly with traffic. At 41,000 calls/week the same pattern projects to around $100/week in saved completion cost; on heavier workloads with longer prompts (the FAQ-context-stuffed case from the opening), the savings climb into the four-figure-per-week range fast.
What this cache won't do
Honest accounting. This cache assumes responses are deterministic enough at the same temperature to be reusable. If you're running at temperature 0.9 to get creative variation, caching is the wrong tool. Cache only the deterministic-by-design calls: classifications, extractions, FAQ-style lookups, structured-output calls.
It also doesn't handle invalidation when your underlying knowledge changes. If the cache says "our refund window is 30 days" and your policy moves to 14 days, the cache will keep serving the stale answer until the TTL expires. Set the TTL to match the volatility of your domain: 1 hour for support content that moves weekly, 24 hours for stable product docs, 5 minutes for anything that touches user state.
And it does not replace proper LLM observability. You still want traces on your LLM calls, hit-rate metrics on the cache itself, and alarms on cache size and eviction rate.
Drop-in pattern
The full integration into a typical LLM call site is one line:
response = cached_call(cache, key, lambda k: client.complete(...))
That's the contract. Everything above is implementation detail. The cache instance lives at module level, the embedder is a thin wrapper around your embedding provider, and call_llm is whatever function you were already using. If you have GPTCache or LangChain caching wired up and it's working for you, leave it. If you're staring at a four-figure weekly bill and thinking about a vendor evaluation, write the 100 lines first.
If this was useful
Caching is the cheapest LLM-cost intervention you can ship. The LLM Observability Pocket Guide covers what to put on a cache span, how to track hit rate without skewing the latency histogram, and the failure modes that look like cost wins but are actually correctness regressions. The AI Agents Pocket Guide has the agent-specific section: when caching tool calls is safe, when it's a foot-gun, and the agent patterns where cache-hit telemetry doubles as quality signal.


Top comments (0)