DEV Community

Cover image for I Tested 28 Query Pairs to See if Semantic Caches Actually Lie to Users. The Result Surprised Me
Kristian Ivanov
Kristian Ivanov

Posted on • Originally published at betterdb.com

I Tested 28 Query Pairs to See if Semantic Caches Actually Lie to Users. The Result Surprised Me

I had a clean theory about how RAG caches silently corrupt your answers. Then I built one, ran the numbers, and the actual failure mode was the opposite of what I expected.

Let's face it

Most of us building RAG pipelines treat the LLM call as atomic. Query comes in, embed, retrieve, generate, return. If we cache anything, we slap a Redis in front of it, key by the query string, call it a day, and move on to the next ticket.

Then someone tells you "we should use semantic caching, embeddings will catch the paraphrases" and you nod, set a similarity threshold of 0.95 because that sounds reasonable, and ship it.

Last weekend I decided to actually build the thing end-to-end and see what shakes out. The setup: a public RAG chatbot trained on the docs of three RESP-compatible databases - Valkey, Redis, and Dragonfly - with full caching infrastructure underneath and live observability on every turn. The whole thing is at chat.betterdb.com and it shows you live data for each query, as well as the aggregates. It is also public at github.com/BetterDB-inc/playground-chat if you want to poke at it.

I went into the weekend with a theory I was pretty sure about: semantic caches silently lie to your users. Two queries one word apart, like "Does Dragonfly support WAIT?" vs "Does Valkey support WAIT?", would collide in embedding space, both above the 0.95 threshold, and the cache would confidently serve the wrong answer to whoever asked second. Cache poisoning. Confident wrong answers. Drama.

By Sunday night I had the chatbot running, the cache instrumented, and a script measuring real similarity numbers. The data said something else. And the something-else turns out to be more interesting than my theory.

The Weekend Setup

The architecture I shipped is a three-tier cache:

Tier 1: Exact-match KV cache. Keyed by the normalized query string. Sub-millisecond. No embeddings. If the query has been seen verbatim before, return the cached answer and stop.

Tier 2: Semantic cache. Only consulted on tier 1 miss. Embed the query (text-embedding-3-small), compare against stored embeddings, return the cached answer if cosine similarity is above threshold.

Tier 3: The actual RAG pipeline. Only on tier 2 miss. Retrieve from a Valkey-backed vector store, generate an answer, store the result back into both caches.

The reason this stacks instead of collapsing into a single semantic cache is that real traffic has two completely different shapes of repetition.

Shape one: machines retrying the same call with the same arguments. Polling. Retries on transient errors. Agents calling the same tool with the same parameters in a deterministic loop. Copy-pasted questions from documentation. Selecting pre-defined suggestions from the chat itself. These produce byte-identical query strings and the cache hit on them should be trivially cheap.

Shape two: humans rephrasing the same intent in different words. "What is Valkey?" vs "Tell me about Valkey." vs "Explain Valkey to me." Same answer, three different strings. These need semantic matching to catch.

Tier 1 catches shape one cheaply. Tier 2 catches shape two when the embedding model cooperates. Tier 3 only runs when both have legitimately failed.

After a few hundred queries, the chatbot was running at ~71% combined hit rate, with tier 1 catching most of those. Tier 1 hits return in under 5ms. Tier 2 in roughly 100ms (the embedding call dominates). Tier 3 misses in 2-10s depending on the model. At chatbot scale, the dollar savings are unimpressive - fractions of a cent - but the latency compounds into something users actually notice.

So far so good. The architecture worked. Then I went hunting for the cache poisoning case.

What I Expected vs What I Actually Measured

I wrote a script that fed text-embedding-3-small 28 query pairs and measured cosine similarity on each. Twenty-four were entity-swap pairs (same question, swap Redis for Valkey for Dragonfly). Four were paraphrase pairs (same question, different wording, no entity swap).

My hypothesis: the entity swaps would land in the dangerous zone - high enough similarity (0.92+) that a typical 0.95 threshold would treat them as the same query, even though they need different answers.

What actually happened:

Cosine similarity, text-embedding-3-small:

ENTITY SWAPS (different question, swap Redis/Valkey/Dragonfly)
0.1507  "What is Redis?" vs "What is Valkey?"
0.6264  "How do I configure persistence in Redis?" vs "...in Valkey?"
0.6320  "Does Dragonfly support WAIT?" vs "Does Valkey support WAIT?"
0.7391  "How do I enable AOF persistence in Valkey?" vs "...in Dragonfly?"  (highest)

PARAPHRASES (same question, different wording)
0.8139  "What is Valkey?" vs "Explain Valkey to me."
0.8442  "What is Valkey?" vs "Tell me about Valkey."
0.8830  "How do I install Redis?" vs "What's the installation process for Redis?"
0.9145  "How does Redis cluster work?" vs "Can you explain how Redis clustering works?"
Enter fullscreen mode Exit fullscreen mode

Entity swaps maxed out at 0.74. Paraphrases topped out at 0.91.

Entity swaps clustered between 0.44 and 0.74. Not a single fork-disambiguation pair crossed even a generous 0.85 threshold. "Does Dragonfly support WAIT?" vs "Does Valkey support WAIT?" landed at 0.6320. "How do I configure persistence in Redis?" vs "How do I configure persistence in Valkey?" landed at 0.6264. The highest entity-swap pair in my whole set - "How do I enable AOF persistence in Valkey?" vs "How do I enable AOF persistence in Dragonfly?" - only hit 0.7391.

In other words: text-embedding-3-small is good at this. The model has learned that swapping a database name produces a meaningfully different query. My dramatic cache-poisoning hook doesn't exist on this model.

Paraphrase pairs landed between 0.81 and 0.91. This is where it gets interesting. "What is Valkey?" vs "Tell me about Valkey." - same intent, different words, the exact case semantic caching is supposed to catch - sits at 0.8442. "How does Redis cluster work?" vs "Can you explain how Redis clustering works?" sits at 0.9145.

The actual situation is the opposite of what I expected to find. Entity swaps are safe. Paraphrases are the borderline case.

The Real Failure Mode (And Why It Matters in Production)

If you set your threshold at the commonly-recommended 0.95, here's what your cache actually does on this data:

  • Misses every entity swap. Good. Those need different answers.
  • Misses most legitimate paraphrases. Bad. "Tell me about Valkey" at 0.84 won't hit the cache for "What is Valkey?" - even though the answer is the same.

If you drop the threshold to 0.85 to catch paraphrases:

  • Still misses entity swaps. Good.
  • Catches some paraphrases. Better.
  • Borderline pairs at 0.81-0.83 still miss. Some legitimate hits still leak through.

There is no threshold setting that catches all paraphrases without false positives, but the spread is narrower and lower than I thought going in. The threshold you pick determines which paraphrases you give up on, not whether you accidentally serve wrong-fork answers.

This changes the production story. The dangerous failure mode isn't "your cache is lying to your users." The dangerous failure mode is "your cache is missing more legitimate hits than you realize, and you're paying for tier 3 retrievals you shouldn't be."

That maps directly to what I was watching on the chatbot's live dashboard over the weekend. The hits were happening on tier 1 most of the time. Tier 2 was the layer that was supposed to catch the human paraphrases - and it was working, but with a wider miss band than I'd assumed when I picked the threshold.

Five Ways to Address the Borderline Band

The interesting band is 0.80-0.92. Below 0.80, the cache misses anyway. Above 0.92, a sane threshold lets it through. The question is what to do about the legitimate paraphrases that fall in between.

1. Lower the threshold. Drop from 0.95 to 0.85 and you catch more paraphrases. You also start risking false positives on whatever lives in that band - which on text-embedding-3-small is mostly fine for entity disambiguation but might not hold for queries with subtler differences.

2. Per-namespace caches. Partition by detected entity. A query mentioning Dragonfly only checks against cached Dragonfly queries. Useful if your domain has a small set of high-value entities. Requires named-entity recognition (a model call) and fails on queries that don't name the entity.

3. Query rewriting before embedding. Use a small fast model to canonicalize the query - resolve pronouns, expand abbreviations, name implicit subjects - before computing the embedding. This pulls borderline paraphrases up into hit range by stripping surface variation.

4. Ensemble matching. Compute similarity across multiple embedding spaces and require agreement. Catches some borderline hits the single embedding misses. Doubles your embedding cost.

5. LLM as judge. On a tier 2 candidate match in the borderline band, ask a small fast model whether the cached answer actually applies to the new query. The judge sees both queries plus the cached answer and returns yes or no.

The fifth option is the one that actually changes the shape of the problem. The first four trade one tradeoff for another. LLM-as-judge handles intent-level distinctions that embedding similarity flattens. Cost is one cheap model call (~200-400ms, sub-cent on a 4o-mini or Haiku class model) on every borderline candidate. That's still dramatically cheaper than a tier 3 miss.

For low-stakes caches, just pick a threshold and accept the miss rate. For caches where missing a legitimate paraphrase costs you (every uncaught paraphrase is a tier 3 retrieval you didn't need to pay for), wire up a judge for the borderline band.

What I Wish I'd Instrumented From Day One

The instrumentation is what made the whole weekend worth it. Without it, the cache would have been a black box and I'd still be telling the dramatic-but-wrong story I started with.

Minimum viable instrumentation:

  • Per-turn hit/miss for each tier, with similarity score on tier 2
  • The cached query that matched (or almost matched), so you can audit borderline cases
  • Latency saved per hit, computed against rolling average miss latency
  • Tokens and dollars saved per hit, against the per-model rate card
  • Distribution of similarity scores on near-misses, so you can see what's sitting just below your threshold

Expose all of it as OpenTelemetry spans and Prometheus metrics. You get tracing, alerting, and dashboards for free.

The interesting metric is not the hit rate. Hit rate is a vanity metric. The interesting metric is the distribution of similarity scores on near-misses. If you see a fat cluster of queries sitting at 0.82-0.84 just below your 0.85 threshold, those are paraphrases you're paying full price for when you shouldn't be. Lower the threshold or wire up a judge for that band.

Also: instrument tool call results separately from LLM responses. They have different repetition profiles. Tool calls repeat near-identically because agents are deterministic in argument generation. LLM responses repeat by paraphrase because users aren't. Caching them with the same configuration is leaving wins on the table.

The chatbot dashboard shows all of this live as you ask questions. Watching it in real time was the part of the weekend that flipped my mental model - the data made the original theory untenable in a way that just running queries through a notebook wouldn't have.

The Session State Curveball

There's a third thing worth caching that doesn't fit cleanly into either tier: agent session state. Conversation history, partial reasoning trace, intermediate scratchpad. Not really a cache in the usual sense - it's more a checkpoint. But it shares enough infrastructure that it usually lives in the same store.

The relevant detail: session state changes the cache key.

"What is Valkey?" asked in turn 1 of a fresh session is not the same query as "What is Valkey?" asked in turn 7 of a session where prior turns established that the user is asking about Dragonfly compatibility. The honest version of this stack treats conversation context as part of the cache key at the semantic tier.

You give up some hit rate for correctness. A tier 2 lookup that includes context matches fewer cached queries because the context tokens drag the embedding around. But it stops the worst class of collision - the cross-session bleed, where one user's question gets answered with another user's prior context.

If your AI product handles anything sensitive, you want this. The performance hit is real but the alternative is a privacy incident waiting to happen.

What This Changed for Me

I went into Saturday morning with a clean story about silent corruption and confident wrong answers. I came out Sunday night with a different and more useful one. If you remember three things from this article, remember these:

1. The dangerous failure mode is the boring one. On a modern embedding model, semantic caches don't silently lie to your users. They silently under-perform by missing legitimate paraphrases that sit in the 0.80-0.92 borderline band. Your bill stays higher than it should and you never know why. That's a much less satisfying story than cache poisoning, but it's the one that's actually costing you money in production right now.

2. Threshold tuning is one knob, not the answer. No single threshold separates legitimate paraphrases from edge cases cleanly. The interesting work isn't picking the right threshold - it's deciding what to do about the borderline band. Lower the threshold and accept some false positives, partition by entity, rewrite queries before embedding, or wire up an LLM as judge for the borderline cases. Pick whichever pencils out for your traffic shape and your tolerance for wrong answers.

3. The metric you should care about isn't hit rate. Hit rate is a vanity number. The metric that actually drives decisions is the distribution of similarity scores on near-misses. If you've got a fat cluster sitting at 0.83 just below your 0.85 threshold, those are paraphrases you're paying full freight for. Find them, decide what to do about them, instrument the decision, repeat.

Why the Weekend Mattered

The reason I keep emphasizing "weekend" isn't that I'm bragging about how fast I built something. It's that the weekend itself is the point.

If I'd written this article without building the chatbot, I would have shipped my original theory. The 0.97 cosine similarity collision. The cache silently lying. The dramatic version. It's a clean story, it sounds plausible, it would have gotten upvotes. It also would have been wrong, and a few hundred engineers would have walked away with a confident wrong belief about how text-embedding-3-small handles entity disambiguation.

The thing that prevented that wasn't intelligence or experience or skepticism. It was building the thing. Once you have a working cache with live observability and a script that can measure pairs in two minutes, you stop theorizing and start measuring. The dashboard shows you what's actually happening. The numbers don't care about your hypothesis.

One important caveat applies to everything I just wrote: all of this is one model, one weekend, one chatbot's worth of data. Older or weaker embedding models almost certainly behave differently. If you're running text-embedding-ada-002 or one of the smaller sentence-transformers, the entity-swap collision I expected to find might actually exist in your stack. The whole point of the exercise is that you should run your own pairs, not that you should trust mine.

The chatbot is still up at chat.betterdb.com if you want to play with it - the side panel shows every cache hit/miss, similarity score, and savings per turn. The libraries underneath (@betterdb/agent-cache and @betterdb/semantic-cache, MIT on npm and PyPI) wrap the pattern with adapters for OpenAI, Anthropic, Bedrock, LangChain, LangGraph, LlamaIndex, and Vercel AI SDK. OTel and Prometheus wired in by default. Fork it, point it at your own docs, run the similarity script against your real query distribution. The dashboard will tell you what's actually happening in your borderline band, and that's the only data that matters for tuning.

The Bottom Line

Caching for RAG is not the silent-corruption horror story I went in expecting. On a well-calibrated modern embedding model, entity disambiguation works. The actual problem is more boring and more fixable: paraphrases land in a narrow borderline band, your threshold is probably set too high to catch them, and you're paying for tier 3 retrievals you don't need to.

The two-tier structure addresses the cost side. LLM-as-judge in the borderline band addresses the paraphrase miss side. Instrumentation tells you which problem you actually have. Session state in the cache key keeps you out of the cross-session-bleed hall of shame.

But the meta-takeaway from the weekend is simpler than any of that: build the thing, instrument it, measure it, then write about it. The dramatic story is almost always wrong. The boring measured story is almost always more useful.

What failure modes have you actually measured in your own RAG caching? Anyone running LLM-as-judge in the borderline band yet - does the latency hit pencil out? What's the gap between your assumed and actual cache hit rate? Drop your war stories in the comments - I'd love to know what people are seeing on different models.

Top comments (0)