Designing a semantic cache layer for cost and latency optimization in LLM systems.
Most LLM cost isn’t spent on novelty.
It’s spent on repetition, requests that are semantically identical, but syntactically different.
PromptCache was built to eliminate that redundancy.
The Invisible Cost Leak in LLM Systems
If you’re running an LLM in production, you are almost certainly paying for this:
- "How do I reset my password?"
- "I forgot my password, what do I do?"
- "Steps to reset account password?"
- "Help me change password"
Different strings.
Same intent.
Same answer.
Different billable request.
Traditional caching doesn't help because:
"How do I reset my password?" != "Steps to reset account password?"
Exact match fails.
But meaning hasn't changed.
That's where semantic caching comes in.
The Theory: Why This Works
LLMs don't understand text as strings.
They convert text into vectors (embeddings).
Two sentences with similar meaning produce vectors that are close together in high-dimensional space.
Example (simplified):
"Reset my password"
↓
[0.12, -0.87, 0.44, ...]
"How do I change my password?"
↓
[0.11, -0.89, 0.41, ...]
These vectors are very close.
So instead of asking:
"Have I seen this exact string before?"
We ask:
"Have I seen something semantically similar before?"
If the similarity is high enough, we reuse the answer.
That's semantic caching.
How It Works in Practice
When a request comes in:
User Prompt
↓
Embedding
↓
Vector search in Redis
↓
High similarity?
↓
Yes → Return cached response
No → Call LLM and store result
You're adding a semantic memoization layer in front of your LLM.
Real Results
In a support-heavy workload with repetitive queries:
- ~60% cache hit rate
- ~50% reduction in token usage
- ~40% lower API spend
Results vary by workload density and repetition patterns, but in structured environments, the impact is immediate.
Example Implementation
Here's a simplified example using Redis vector search:
from promptcache import SemanticCache
from promptcache.backends.redis_vector import RedisVectorBackend
from promptcache.embedders.openai import OpenAIEmbedder
from promptcache.types import CacheMeta
embedder = OpenAIEmbedder(model="text-embedding-3-small")
backend = RedisVectorBackend(
url="redis://localhost:6379/0",
dim=embedder.dim,
)
cache = SemanticCache(
backend=backend,
embedder=embedder,
namespace="support-bot",
threshold=0.92,
)
meta = CacheMeta(
model="gpt-4.1-mini",
system_prompt="You are a helpful support assistant.",
)
result = cache.get_or_set(
prompt="How can I change my password?",
llm_call=my_llm_call,
extract_text=lambda r: r.output_text,
meta=meta,
)
print(result.cache_hit)
That's it.
No orchestration framework required.
If you want to try this approach, I packaged it up here:
Install:
pip install promptcache-ai
When This Works Best
Semantic caching is powerful when:
- Prompts are repetitive
- Temperature is low
- Answers are stable
- Volume is high
It won't help much for:
- Highly personalized prompts
- Creative writing
- Rapidly changing context
In those cases, novelty dominates repetition, and caching provides diminishing returns.
The Bigger Insight
Most LLM systems are fundamentally stateless.
They recompute answers even when nothing meaningful has changed.
Semantic caching introduces selective memory, reusing intelligence only when it is economically justified.
Instead of optimizing prompts endlessly, sometimes the smarter move is optimizing infrastructure.
If you're building LLM systems in production, semantic caching is one of the highest-leverage optimizations you can add.
But optimizing cost raised a more uncomfortable question:
What guarantees that a cache hit is actually correct?
In the next article, we examine how high hit rates can silently mask semantic errors, and why PromptCache evolved beyond threshold tuning.
Intelligence is expensive.
Memory is cheap.
Use both wisely.
Top comments (2)
Prompt caching is one of those optimizations that sounds obvious in retrospect but took me embarrassingly long to implement properly. The part that surprised me most was how much of our traffic was near-duplicate after we started logging and analyzing prompts. We had assumed each user request was unique, but it turned out about 30% were semantically equivalent queries phrased differently. One wrinkle I ran into: TTL management is trickier than it looks. We set cache TTL too long initially and started serving stale responses after we updated our system prompt. Now we version our system prompts and use that as part of the cache key. Curious if your implementation handles prompt versioning or if you just flush on deploy.
Great point! TTL alone isn't sufficient once behavior changes. Here, cache identity isn't just
prompt_embedding. It's effectively:(namespace, model, ctx_hash, embedder)Where
ctx_hashis a deterministic hash over:That means if we change the system prompt or tool schema, the logical cache partition changes automatically. No global flush required, and no risk of serving responses generated under a different instruction contract.
TTL is used purely for lifecycle control (memory pressure, economic windowing), not correctness guarantees.
In practice I've found that TTL, controls staleness horizon and context hashing, controls behavioral equivalence
Without the second, semantic similarity alone can't guarantee safe reuse.
Versioning the system prompt explicitly, like you're doing, achieves the same isolation just at a more visible layer.
This is exactly where semantic caching becomes infrastructure engineering rather than just an optimization.