Tasos Nikolaou

Posted on Mar 5

PromptCache Part II: When High Cache Hit Rates Become Dangerous

#ai #architecture #machinelearning #backend

A benchmark-driven look at semantic cache safety and intent isolation.

In the previous article, "Stop Paying Twice for the Same LLM Answer" - I introduced PromptCache as a semantic caching layer designed to reduce LLM cost and latency.

The premise was simple, if two prompts are semantically similar, we shouldn't pay for the answer twice. The results were compelling: high cache hit rates, significant cost reduction, lower latency. But after deploying and stress-testing the design, a more important question emerged:

What guarantees that a cache hit is actually correct?

Reducing cost is easy.
Ensuring safe reuse is harder.

This article documents the experiment that reshaped PromptCache's architecture, and why intent isolation became a non-negotiable design constraint.

Measuring Semantic Cache Safety in LLM Systems

Most semantic caches follow a simple pattern:

Embed the prompt
Retrieve the nearest cached embedding
If similarity ≥ threshold then reuse the answer

This works well for performance.
But it assumes something that isn't guaranteed:

That semantic similarity implies safe reuse.

To test that assumption, I built a controlled benchmark.

The Question

Can cosine similarity thresholding alone guarantee safe reuse? Or do we need structural isolation between tasks? To answer this, I defined a metric:

Unsafe Hit

A cache hit is unsafe if the returned answer belongs to a different task (intent) than the incoming request. This measures semantic collision, not embedding quality.

What Is Intent Isolation?

Intent isolation means:

Partition the semantic cache by task boundary before performing similarity search.

Instead of searching across all cached entries, we search only within the matching task. Similarity becomes a refinement step, not a boundary mechanism.

**Figure 0 - Semantic Cache Search Space.**
Without isolation, similarity search spans all tasks in one shared embedding space.

With isolation, search is restricted to the matching intent partition.

Experimental Setup

I evaluated semantic caching across:

Embedding model: all-MiniLM-L6-v2
Backends:
- In-memory brute-force cosine
- Redis (HNSW via RediSearch)
Workloads:
- Support queries
- RAG-style retrieval questions
- Creative prompts
Threshold sweep: 0.82 -> 0.92
~2400 requests per configuration

Two configurations were tested:

1. No Intent Isolation

All prompts shared the same semantic space.

2. Intent Isolation Enabled

Cache entries were partitioned by intent_id.

Each configuration was evaluated across identical prompt sequences to ensure comparability.
Unsafe hits were computed by comparing stored intent_id against request intent.

Result 1 - Hit Rate Looked Excellent

Without intent isolation:

Hit rate: ~97-99% With intent isolation:
Hit rate: 13-38% depending on threshold

At first glance, the non-isolated configuration looks superior. But this metric is incomplete.

Result 2 - Unsafe Hit Rate Reveals the Problem

Without intent isolation:

Unsafe hit rate: ~95-100% With intent isolation:
Unsafe hit rate: 0%

**Figure 2 - Unsafe Hit Rate vs Threshold.** Similarity thresholding alone does not prevent cross-intent reuse. Nearly all cache hits become unsafe without intent partitioning.

This pattern was consistent across support, RAG, and creative workloads. In other words:

Almost every "successful" cache hit without isolation was incorrect.

This is not a marginal effect. It is structural cross-contamination.

Why Similarity Is Not Enough

Embedding similarity measures geometric proximity in vector space.
Intent boundaries are categorical.

Cosine similarity answers:

"Are these prompts semantically related?"

It does not answer:

"Are these prompts operationally interchangeable?"

Semantic closeness is continuous and task equivalence is discrete. Threshold tuning cannot convert a continuous metric into a categorical guarantee, but, partitioning can.

Result 3 - Backend Did Not Affect Correctness

Both Redis (HNSW) and the in-memory backend produced identical:

Hit rate curves
Unsafe hit rate curves

This was expected, due to the fact that both implemented cosine nearest-neighbor search with identical threshold logic. Correctness was dominated by key structure, not vector store implementation.

Backend choice affects:

Persistence
Multi-process access
Scalability
Latency under load

But correctness properties should not depend on storage details!

Cost Savings Followed Hit Rate

In this benchmark, each miss triggered a full LLM call with similar token usage.

As a result, cost_savings ≈ hit_rate, which confirms internal consistency. But cost reduction is meaningless if reuse is unsafe, meaning that correctness precedes optimization.

Production Implications

If you rely solely on similarity thresholding:

You will inflate hit rates
You will inflate cost savings
You may silently reuse incorrect answers

This is particularly dangerous in:

Multi-tenant systems
Support bots
RAG pipelines
Tool-driven workflows

The correct architectural pattern is:

Partition first.
Threshold second.

Similarity is a refinement mechanism, not a safety boundary.

Limitations

This was a controlled benchmark.

Dataset size was modest (~2-3k prompts)
Workloads were synthetic but structured
Extreme-scale recall behavior was not evaluated
Concurrency stress was not measured

The goal was to isolate semantic collision behavior, not benchmark vector database scalability.

Future work should explore:

Larger datasets
Cross-model embedding drift
Concurrency stress testing
Partial response reuse

Core Insight

The dominant factor in semantic cache correctness is not:

The embedding model
The vector database
The similarity threshold

It is key design.
Intent isolation is not an optimization.
It is a safety requirement.

Final Takeaway

A 98% cache hit rate looks impressive.
But without structural boundaries, it may be misleading.

If your semantic cache shows:

98% hit rate
98% cost savings

Ask yourself, how many of those hits are actually correct?
Optimization without isolation is probabilistic reuse. If you're building LLM infrastructure, this is not an academic nuance, but a production concern.

Similarity optimizes reuse.
Isolation guarantees correctness.

Top comments (2)

klement Gunndu • Mar 6

The intent isolation constraint makes sense, but how do you handle cases where the same prompt genuinely spans multiple intents? Splitting by intent category could miss valid cross-domain cache hits.

Tasos Nikolaou • Mar 6

Good one!! And you're right that intent isolation trades reuse density for safety. If two prompts from different tasks truly have the same valid answer, partitioning by intent will prevent that cache hit. In practice I found the bigger risk was the opposite, meaning that, high-similarity prompts from different tasks reusing answers that were semantically close but "behaviorally" wrong. For instance, support, RAG, and creative prompts often share vocabulary,. but require very different responses.

Intent isolation intentionally narrows the search space before applying similarity thresholds. So, you lose some cross-domain reuse, but you avoid cross-task collisions. In other words, it shifts the design from maximum reuse to safe reuse.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.