Most semantic cache benchmarks are a vendor showing you the one dataset where they win, on a model they finetuned, against a competitor they configured badly. You read it, you nod, you learn nothing.
I built and maintain a semantic cache library (@betterdb/semantic-cache on npm, betterdb-semantic-cache on PyPI, MIT, Valkey-native). So I had two choices. Write that post about my own library, or run the comparison straight and publish it even where I only tie. I did the second one. Four public datasets, two peers (RedisVL and Upstash), one self-tuning loop, and a fair amount of being wrong before being right.
There was no honest cross-library comparison of semantic caches anywhere I could find. So I made one. This is the short version. Links to the full tables and methodology are at the bottom.
1. Quality is a tie. That is the result you want.
Fix the embedding model and every honest semantic cache is doing the same thing: embed the prompt, measure cosine distance against stored prompts, return a hit below a threshold. So peak F1 converges. There is no secret sauce in the lookup.
I landed within roughly one percentage point of F1 against both RedisVL and Upstash, at each library's own optimal threshold, across STSb, SICK, PAWS-Wiki, and a real chatbot-prompt dataset from the vCache paper (ICLR 2026). The largest gap against RedisVL anywhere was 0.004 F1. That is noise.
This is the part vendor benchmarks bury, so let me say it plainly. Cache quality is bounded by the embedding model, not the library. Parity is the ceiling. Reaching it is the price of admission, not the win. Anyone showing you a big F1 lead on a fixed model is either using a model they tuned or a competitor they broke.
So if the lookup is solved, the cache you pick comes down to everything around the lookup. That is the rest of this post.
2. The threshold you copied is probably wrong. (Yes, yours.)
This was the finding that surprised me most, and I only saw it because I ran against two different engines.
Against RedisVL, same engine and same runtime, the similarity distributions were identical and so were the optimal thresholds. Boring. Expected.
Against Upstash they were not. Both adapters used bge-small-en-v1.5 by name. Same weights. But my local ONNX runtime and Upstash's server-side runtime produced different score distributions. Mine spread across [0, 0.50]. Theirs compressed into [0, 0.26]. The optimal threshold followed: 0.20 for me, 0.10 for them.
Read that again. Same model name, different runtime, different correct threshold. Neither is wrong. Each is right for its own runtime.
So a similarity threshold is not a constant you look up. It is a property of your embedding runtime, your data, and your traffic. Copy the number from a blog post (this one included) or a vendor default, and you are very likely running at the wrong cutoff. That is the entire argument for tuning the threshold to your own deployment instead of guessing it.
3. Self-tuning earns its keep. The tuning was the easy part.
So I built a loop that does the tuning for you. The cache logs similarity scores to Valkey, the Monitor reads the distribution and proposes a threshold change with reasoning, a human approves (or doesn't, your call), and the running cache picks up the new value within a second. No restart, no redeploy.
On datasets where the starting threshold was bad, it gained real F1. Up to +2.8% on STSb by loosening. +2.1% on SICK. +2.9% on the chatbot set by tightening. Same code path, no per-dataset config, adapting in both directions.
Here is the honest part. The naive version of this loop destroys performance, and I know because I shipped it first. My first attempt tightened the threshold five times in a row chasing a signal that was actually noise, and dropped F1 from 0.57 to 0.49. It saw "distant hits," said "tighten," and kept saying it because the signal never went away.
The engineering that matters is not the tuning. It is the four safety mechanisms that stop the loop from tuning itself off a cliff:
- Signal-quality guards so it does not act on noise dressed up as signal.
- Outcome tracking so if the last adjustment did not help, it stops instead of doubling down.
- Velocity dampening so consecutive same-direction steps shrink and eventually stop.
- A recall-cost check so a tighten that would drop too many real hits gets blocked outright.
With those in place, the loop either improves on a static threshold or matches it exactly. It never makes things worse. "Do no harm" turned out to be the hardest requirement in the whole thing, and it is the one nobody writes a blog post about.
4. Latency is mostly a deployment story, not an algorithm story.
Two latency numbers, and they mean different things, so I am going to keep them separate instead of mashing them into one big scary multiple.
Against RedisVL, on the same engine: about 7x faster on repeated queries (around 0.57ms vs 3.46ms p50). This is real and it is a library difference. I cache prompt embeddings keyed by hash; RedisVL recomputes the embedding on every call, even if you sent the same prompt three seconds ago. Any workload with prompt repetition (chatbots, agent retry loops, FAQ-shaped traffic) gets this for free.
Against Upstash: 48 to 136x. That number looks incredible and it is also not a fair fight. It is local Valkey against a cloud REST API. I am racing a process on localhost against a network round trip to another region. That is a deployment difference, not an algorithmic one, and I am not going to pretend otherwise.
The useful version of the claim: run the cache next to your app on Valkey you operate, and you get sub-millisecond lookups that a managed cloud vector API structurally cannot match, because the network hop is not something the vendor can optimize away. If zero ops footprint matters more to you than latency, the managed round trip is a perfectly reasonable trade. Just know which one you are buying.
5. Adversarial paraphrases are a wall for everyone.
On PAWS-Wiki, every adapter I tested plateaus around 61% F1. Mine, RedisVL's, Upstash's. All of them.
"Flights from NY to FL" and "flights from FL to NY" have nearly identical embeddings. No threshold separates them. No amount of tuning recovers signal the embedding never captured. I even threw an LLM judge at it; at default settings the uncertainty band only catches a handful of pairs, so it does not move aggregate F1.
This is not a BetterDB limit or a competitor limit. It is a property of cosine-distance caching. If your workload looks like PAWS, you need a different architecture (cross-encoder rerank, structural parsing, domain-specific embeddings), not a better cache. I would rather tell you that than hide it behind friendly datasets.
What the week actually argues
One point, told five ways: once everyone is at the quality ceiling, the cache you pick should be decided by what it does around the lookup, not by a fractional F1 delta.
For what it is worth, here is where I built deliberately, since the lookup was never going to be the differentiator:
- OpenTelemetry spans and Prometheus metrics emitted from the library, no wiring.
- Dollars-saved per cache hit, computed from a bundled price table (1,900+ models via LiteLLM), not a marketing counter.
- The self-tuning loop above, free, through the Monitor.
- MIT, runs on Valkey or any RESP-compatible endpoint, seven framework adapters, five embedding providers, no backend or vendor lock-in.
One honest caveat: the semantic cache needs valkey-search (Valkey 8+), so on stock ElastiCache or MemoryDB you would reach for the exact-match cache instead.
Reproduce it
Every number is reproducible. The harness, dataset loaders, and raw output are open source (Python, TypeScript). If I got something wrong, the issues tab is open.
The three deep-dives, with the full per-dataset tables:
The library is MIT: @betterdb/semantic-cache (npm), betterdb-semantic-cache (PyPI).
What benchmarks do you actually trust, and which ones do you assume are rigged until proven otherwise? Curious where other people draw that line.
Top comments (0)