DEV Community

Cover image for PromptCache Part II: When High Cache Hit Rates Become Dangerous
Tasos Nikolaou
Tasos Nikolaou

Posted on

PromptCache Part II: When High Cache Hit Rates Become Dangerous

A benchmark-driven look at semantic cache safety and intent isolation.

In the previous article, "Stop Paying Twice for the Same LLM Answer" - I introduced PromptCache as a semantic caching layer designed to reduce LLM cost and latency.

The premise was simple, if two prompts are semantically similar, we shouldn't pay for the answer twice. The results were compelling: high cache hit rates, significant cost reduction, lower latency. But after deploying and stress-testing the design, a more important question emerged:

What guarantees that a cache hit is actually correct?

Reducing cost is easy.
Ensuring safe reuse is harder.

This article documents the experiment that reshaped PromptCache's architecture, and why intent isolation became a non-negotiable design constraint.


Measuring Semantic Cache Safety in LLM Systems

Most semantic caches follow a simple pattern:

  1. Embed the prompt
  2. Retrieve the nearest cached embedding
  3. If similarity ≄ threshold then reuse the answer

This works well for performance.
But it assumes something that isn't guaranteed:

That semantic similarity implies safe reuse.

To test that assumption, I built a controlled benchmark.


The Question

Can cosine similarity thresholding alone guarantee safe reuse? Or do we need structural isolation between tasks? To answer this, I defined a metric:

Unsafe Hit

A cache hit is unsafe if the returned answer belongs to a different task (intent) than the incoming request. This measures semantic collision, not embedding quality.


What Is Intent Isolation?

Intent isolation means:

Partition the semantic cache by task boundary before performing similarity search.

Instead of searching across all cached entries, we search only within the matching task. Similarity becomes a refinement step, not a boundary mechanism.

Figure 0 - Semantic Cache Search Space.
Without isolation, similarity search spans all tasks in one shared embedding space.

With isolation, search is restricted to the matching intent partition.


Experimental Setup

I evaluated semantic caching across:

  • Embedding model: all-MiniLM-L6-v2
  • Backends:
    • In-memory brute-force cosine
    • Redis (HNSW via RediSearch)
  • Workloads:
    • Support queries
    • RAG-style retrieval questions
    • Creative prompts
  • Threshold sweep: 0.82 -> 0.92
  • ~2400 requests per configuration

Two configurations were tested:

1. No Intent Isolation

All prompts shared the same semantic space.

2. Intent Isolation Enabled

Cache entries were partitioned by intent_id.

Each configuration was evaluated across identical prompt sequences to ensure comparability.
Unsafe hits were computed by comparing stored intent_id against request intent.


Result 1 - Hit Rate Looked Excellent

Without intent isolation:

  • Hit rate: ~97-99% With intent isolation:
  • Hit rate: 13-38% depending on threshold

Figure 1 - Hit Rate vs Threshold. Without intent isolation, semantic caching achieves ~98% hit rate. Enabling intent partitioning significantly reduces reuse density.

At first glance, the non-isolated configuration looks superior. But this metric is incomplete.


Result 2 - Unsafe Hit Rate Reveals the Problem

Without intent isolation:

  • Unsafe hit rate: ~95-100% With intent isolation:
  • Unsafe hit rate: 0%

Figure 2 - Unsafe Hit Rate vs Threshold. Similarity thresholding alone does not prevent cross-intent reuse. Nearly all cache hits become unsafe without intent partitioning.

This pattern was consistent across support, RAG, and creative workloads. In other words:

Almost every "successful" cache hit without isolation was incorrect.

This is not a marginal effect. It is structural cross-contamination.


Why Similarity Is Not Enough

Embedding similarity measures geometric proximity in vector space.
Intent boundaries are categorical.

Cosine similarity answers:

"Are these prompts semantically related?"

It does not answer:

"Are these prompts operationally interchangeable?"

Semantic closeness is continuous and task equivalence is discrete. Threshold tuning cannot convert a continuous metric into a categorical guarantee, but, partitioning can.


Result 3 - Backend Did Not Affect Correctness

Both Redis (HNSW) and the in-memory backend produced identical:

  • Hit rate curves
  • Unsafe hit rate curves

This was expected, due to the fact that both implemented cosine nearest-neighbor search with identical threshold logic. Correctness was dominated by key structure, not vector store implementation.

Backend choice affects:

  • Persistence
  • Multi-process access
  • Scalability
  • Latency under load

But correctness properties should not depend on storage details!


Cost Savings Followed Hit Rate

In this benchmark, each miss triggered a full LLM call with similar token usage.

As a result, cost_savings ā‰ˆ hit_rate, which confirms internal consistency. But cost reduction is meaningless if reuse is unsafe, meaning that correctness precedes optimization.


Production Implications

If you rely solely on similarity thresholding:

  • You will inflate hit rates
  • You will inflate cost savings
  • You may silently reuse incorrect answers

This is particularly dangerous in:

  • Multi-tenant systems
  • Support bots
  • RAG pipelines
  • Tool-driven workflows

The correct architectural pattern is:

Partition first.
Threshold second.

Similarity is a refinement mechanism, not a safety boundary.


Limitations

This was a controlled benchmark.

  • Dataset size was modest (~2-3k prompts)
  • Workloads were synthetic but structured
  • Extreme-scale recall behavior was not evaluated
  • Concurrency stress was not measured

The goal was to isolate semantic collision behavior, not benchmark vector database scalability.

Future work should explore:

  • Larger datasets
  • Cross-model embedding drift
  • Concurrency stress testing
  • Partial response reuse

Core Insight

The dominant factor in semantic cache correctness is not:

  • The embedding model
  • The vector database
  • The similarity threshold

It is key design.
Intent isolation is not an optimization.
It is a safety requirement.


Final Takeaway

A 98% cache hit rate looks impressive.
But without structural boundaries, it may be misleading.

If your semantic cache shows:

  • 98% hit rate
  • 98% cost savings

Ask yourself, how many of those hits are actually correct?
Optimization without isolation is probabilistic reuse. If you're building LLM infrastructure, this is not an academic nuance, but a production concern.


Similarity optimizes reuse.
Isolation guarantees correctness.

Top comments (0)