A benchmark-driven look at semantic cache safety and intent isolation.
In the previous article, "Stop Paying Twice for the Same LLM Answer" - I introduced PromptCache as a semantic caching layer designed to reduce LLM cost and latency.
The premise was simple, if two prompts are semantically similar, we shouldn't pay for the answer twice. The results were compelling: high cache hit rates, significant cost reduction, lower latency. But after deploying and stress-testing the design, a more important question emerged:
What guarantees that a cache hit is actually correct?
Reducing cost is easy.
Ensuring safe reuse is harder.
This article documents the experiment that reshaped PromptCache's architecture, and why intent isolation became a non-negotiable design constraint.
Measuring Semantic Cache Safety in LLM Systems
Most semantic caches follow a simple pattern:
- Embed the prompt
- Retrieve the nearest cached embedding
- If similarity ā„ threshold then reuse the answer
This works well for performance.
But it assumes something that isn't guaranteed:
That semantic similarity implies safe reuse.
To test that assumption, I built a controlled benchmark.
The Question
Can cosine similarity thresholding alone guarantee safe reuse? Or do we need structural isolation between tasks? To answer this, I defined a metric:
Unsafe Hit
A cache hit is unsafe if the returned answer belongs to a different task (intent) than the incoming request. This measures semantic collision, not embedding quality.
What Is Intent Isolation?
Intent isolation means:
Partition the semantic cache by task boundary before performing similarity search.
Instead of searching across all cached entries, we search only within the matching task. Similarity becomes a refinement step, not a boundary mechanism.

Without isolation, similarity search spans all tasks in one shared embedding space.
With isolation, search is restricted to the matching intent partition.
Experimental Setup
I evaluated semantic caching across:
- Embedding model:
all-MiniLM-L6-v2 - Backends:
- In-memory brute-force cosine
- Redis (HNSW via RediSearch)
- Workloads:
- Support queries
- RAG-style retrieval questions
- Creative prompts
- Threshold sweep: 0.82 -> 0.92
- ~2400 requests per configuration
Two configurations were tested:
1. No Intent Isolation
All prompts shared the same semantic space.
2. Intent Isolation Enabled
Cache entries were partitioned by intent_id.
Each configuration was evaluated across identical prompt sequences to ensure comparability.
Unsafe hits were computed by comparing stored intent_id against request intent.
Result 1 - Hit Rate Looked Excellent
Without intent isolation:
- Hit rate: ~97-99% With intent isolation:
- Hit rate: 13-38% depending on threshold

At first glance, the non-isolated configuration looks superior. But this metric is incomplete.
Result 2 - Unsafe Hit Rate Reveals the Problem
Without intent isolation:
- Unsafe hit rate: ~95-100% With intent isolation:
- Unsafe hit rate: 0%

This pattern was consistent across support, RAG, and creative workloads. In other words:
Almost every "successful" cache hit without isolation was incorrect.
This is not a marginal effect. It is structural cross-contamination.
Why Similarity Is Not Enough
Embedding similarity measures geometric proximity in vector space.
Intent boundaries are categorical.
Cosine similarity answers:
"Are these prompts semantically related?"
It does not answer:
"Are these prompts operationally interchangeable?"
Semantic closeness is continuous and task equivalence is discrete. Threshold tuning cannot convert a continuous metric into a categorical guarantee, but, partitioning can.
Result 3 - Backend Did Not Affect Correctness
Both Redis (HNSW) and the in-memory backend produced identical:
- Hit rate curves
- Unsafe hit rate curves
This was expected, due to the fact that both implemented cosine nearest-neighbor search with identical threshold logic. Correctness was dominated by key structure, not vector store implementation.
Backend choice affects:
- Persistence
- Multi-process access
- Scalability
- Latency under load
But correctness properties should not depend on storage details!
Cost Savings Followed Hit Rate
In this benchmark, each miss triggered a full LLM call with similar token usage.
As a result, cost_savings ā hit_rate, which confirms internal consistency. But cost reduction is meaningless if reuse is unsafe, meaning that correctness precedes optimization.
Production Implications
If you rely solely on similarity thresholding:
- You will inflate hit rates
- You will inflate cost savings
- You may silently reuse incorrect answers
This is particularly dangerous in:
- Multi-tenant systems
- Support bots
- RAG pipelines
- Tool-driven workflows
The correct architectural pattern is:
Partition first.
Threshold second.
Similarity is a refinement mechanism, not a safety boundary.
Limitations
This was a controlled benchmark.
- Dataset size was modest (~2-3k prompts)
- Workloads were synthetic but structured
- Extreme-scale recall behavior was not evaluated
- Concurrency stress was not measured
The goal was to isolate semantic collision behavior, not benchmark vector database scalability.
Future work should explore:
- Larger datasets
- Cross-model embedding drift
- Concurrency stress testing
- Partial response reuse
Core Insight
The dominant factor in semantic cache correctness is not:
- The embedding model
- The vector database
- The similarity threshold
It is key design.
Intent isolation is not an optimization.
It is a safety requirement.
Final Takeaway
A 98% cache hit rate looks impressive.
But without structural boundaries, it may be misleading.
If your semantic cache shows:
- 98% hit rate
- 98% cost savings
Ask yourself, how many of those hits are actually correct?
Optimization without isolation is probabilistic reuse. If you're building LLM infrastructure, this is not an academic nuance, but a production concern.
Similarity optimizes reuse.
Isolation guarantees correctness.
Top comments (0)