Scaling LLMs: Why Deterministic Hashing Isn't Enough

Suraj Panda — Sat, 04 Jul 2026 09:55:13 +0000

After all the hype around tokenmaxxing, we have finally realised something that was hiding in plain sight: every LLM request comes at a cost. This becomes even more of a challenge when enterprises start taking their AI PoCs to production and first encounter system design’s most fundamental problem: scale. Repeated questions can quickly turn into a massive cloud bill.
But in my learning journey, I have noticed this: Customers rarely ask the same question with the exact same wording. For example, in an e-commerce support chatbot, the most common questions would be: "Where is my order?", "Can you track my package?", "Has my shipment been dispatched yet?", "When will my order arrive?” - All seemingly different prompts, but all seeking the same information.
Great, so why not just cache the most requested information? Well, a traditional cache only works if the request is identical. Change a few words, and it's a cache miss; resulting in another expensive LLM call.

One solution I could think of and ended up building was a Go library for semantic LLM caching that combines deterministic lookups with vector similarity search.

Link to my repo: https://github.com/Suraj370/semantic-cache-for-llm
The design ended up as two lookup paths that run in sequence on every call. First, a deterministic path: normalize the request, hash the messages and parameters with xxhash, wrap that into a UUIDv5, and do an O(1) point lookup. If that misses, fall through to a semantic path — embed the flattened conversation, run an ANN search (HNSW under Redis or Qdrant, GraphQL-backed search on Weaviate, KNN on Pinecone) filtered by tenant, model, and provider, and accept anything above a similarity threshold.
The direct path is the boring, load-bearing part. Change the temperature, change the model, change one word in the prompt, and you land in a completely different bucket — no false positives, ever, because it's just a hash. The semantic path is where the actual value is: it's what turns “Where is my order?" and "what’s my order status?" into the same cache entry.
I also wanted this to be backend-agnostic from day one, so VectorStore is an interface, and Redis, Weaviate, Qdrant, and Pinecone all implement it. Same with embeddings — Embedder is one method and a dimension, so anything OpenAI-compatible (Azure, Ollama, Together) works without touching the cache logic.

Three things fought me the whole way.
Cache key composition. It sounds trivial until you actually sit down and decide what belongs in the hash. Model name? Provider? Temperature? System prompt? I went back and forth on whether to include the system prompt in the hash at all — some teams rotate boilerplate instructions constantly and would never get a hit if I included it, others rely on the system prompt to change behavior entirely and would get wrong hits if I excluded it. I ended up making it configurable (ExcludeSystemPrompt) instead of picking a side, which felt like a cop-out at first but turned out to be the right call — different teams genuinely want different behavior here.
Concurrency without leaking. The write-back is async by design — you get your cache miss immediately, call the real LLM, and the store happens in the background so you're not blocking the response on a vector DB write. That's simple to say and annoying to get right. I ended up with three separate background goroutines doing different jobs: async write workers, a reaper for per-request cache state (60-minute TTL), and a separate reaper for stream accumulators (5-minute idle TTL, because a client that starts streaming and disappears shouldn't leak memory forever). Getting WaitForPendingOps() and Close() to actually drain everything cleanly, instead of dropping writes on shutdown, took more debugging than I want to admit.
Streaming. Caching a single JSON blob is easy. Caching a stream of chunks and replaying them later so the client experience looks identical to a live stream — same pacing expectations, same chunk boundaries — is not. StoreStream and the chunk replay logic went through a few rewrites before it stopped feeling like a hack bolted onto the non-streaming path.
based on the query, improving semantic cache hit rates while minimizing incorrect matches.
Don't skip the direct path just because the semantic path is the interesting part. I almost did, early on — it felt like a formality compared to the ANN search. It's not. It's the thing that makes the semantic path safe to trust, because it guarantees you never get a false hit on parameters that actually matter (model, temperature, provider). The fuzzy matching is the feature people notice; the exact matching is the reason they can rely on it.
If you want to see the whole thing, the code and the observability setup (Prometheus metrics, a Grafana dashboard) are in the repo. I'm still finding edge cases in the threshold tuning — 0.8 cosine similarity is a reasonable default, but "reasonable" and "right for your traffic" are not the same thing, and I don't think there's a way around testing that yourself.

DEV Community: Suraj Panda

Scaling LLMs: Why Deterministic Hashing Isn't Enough