<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Suraj Panda</title>
    <description>The latest articles on DEV Community by Suraj Panda (@panda_suraj).</description>
    <link>https://dev.to/panda_suraj</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4014839%2F65af8bec-3ab1-4965-9d69-cdf3153e0513.png</url>
      <title>DEV Community: Suraj Panda</title>
      <link>https://dev.to/panda_suraj</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/panda_suraj"/>
    <language>en</language>
    <item>
      <title>Scaling LLMs: Why Deterministic Hashing Isn't Enough</title>
      <dc:creator>Suraj Panda</dc:creator>
      <pubDate>Sat, 04 Jul 2026 09:55:13 +0000</pubDate>
      <link>https://dev.to/panda_suraj/scaling-llms-why-deterministic-hashing-isnt-enough-3pk4</link>
      <guid>https://dev.to/panda_suraj/scaling-llms-why-deterministic-hashing-isnt-enough-3pk4</guid>
      <description>&lt;p&gt;After all the hype around tokenmaxxing, we have finally realised something that was hiding in plain sight: every LLM request comes at a cost. This becomes even more of a challenge when enterprises start taking their AI PoCs to production and first encounter system design’s most fundamental problem: scale. Repeated questions can quickly turn into a massive cloud bill.&lt;br&gt;
But in my learning journey, I have noticed this: Customers rarely ask the same question with the exact same wording. For example, in an e-commerce support chatbot, the most common questions would be: "Where is my order?", "Can you track my package?", "Has my shipment been dispatched yet?", "When will my order arrive?” - All seemingly different prompts, but all seeking the same information.&lt;br&gt;
Great, so why not just cache the most requested information? Well, a traditional cache only works if the request is identical. Change a few words, and it's a cache miss; resulting in another expensive LLM call.&lt;/p&gt;

&lt;p&gt;One solution I could think of and ended up building was a Go library for semantic LLM caching that combines deterministic lookups with vector similarity search.&lt;/p&gt;

&lt;p&gt;Link to my repo: &lt;a href="https://github.com/Suraj370/semantic-cache-for-llm" rel="noopener noreferrer"&gt;https://github.com/Suraj370/semantic-cache-for-llm&lt;/a&gt;&lt;br&gt;
The design ended up as two lookup paths that run in sequence on every call. First, a deterministic path: normalize the request, hash the messages and parameters with xxhash, wrap that into a UUIDv5, and do an O(1) point lookup. If that misses, fall through to a semantic path — embed the flattened conversation, run an ANN search (HNSW under Redis or Qdrant, GraphQL-backed search on Weaviate, KNN on Pinecone) filtered by tenant, model, and provider, and accept anything above a similarity threshold.&lt;br&gt;
The direct path is the boring, load-bearing part. Change the temperature, change the model, change one word in the prompt, and you land in a completely different bucket — no false positives, ever, because it's just a hash. The semantic path is where the actual value is: it's what turns “Where is my order?" and "what’s my order status?" into the same cache entry.&lt;br&gt;
I also wanted this to be backend-agnostic from day one, so VectorStore is an interface, and Redis, Weaviate, Qdrant, and Pinecone all implement it. Same with embeddings — Embedder is one method and a dimension, so anything OpenAI-compatible (Azure, Ollama, Together) works without touching the cache logic.&lt;/p&gt;

&lt;p&gt;Three things fought me the whole way.&lt;br&gt;
Cache key composition. It sounds trivial until you actually sit down and decide what belongs in the hash. Model name? Provider? Temperature? System prompt? I went back and forth on whether to include the system prompt in the hash at all — some teams rotate boilerplate instructions constantly and would never get a hit if I included it, others rely on the system prompt to change behavior entirely and would get wrong hits if I excluded it. I ended up making it configurable (ExcludeSystemPrompt) instead of picking a side, which felt like a cop-out at first but turned out to be the right call — different teams genuinely want different behavior here.&lt;br&gt;
Concurrency without leaking. The write-back is async by design — you get your cache miss immediately, call the real LLM, and the store happens in the background so you're not blocking the response on a vector DB write. That's simple to say and annoying to get right. I ended up with three separate background goroutines doing different jobs: async write workers, a reaper for per-request cache state (60-minute TTL), and a separate reaper for stream accumulators (5-minute idle TTL, because a client that starts streaming and disappears shouldn't leak memory forever). Getting WaitForPendingOps() and Close() to actually drain everything cleanly, instead of dropping writes on shutdown, took more debugging than I want to admit.&lt;br&gt;
Streaming. Caching a single JSON blob is easy. Caching a stream of chunks and replaying them later so the client experience looks identical to a live stream — same pacing expectations, same chunk boundaries — is not. StoreStream and the chunk replay logic went through a few rewrites before it stopped feeling like a hack bolted onto the non-streaming path.&lt;br&gt;
based on the query, improving semantic cache hit rates while minimizing incorrect matches.&lt;br&gt;
Don't skip the direct path just because the semantic path is the interesting part. I almost did, early on — it felt like a formality compared to the ANN search. It's not. It's the thing that makes the semantic path safe to trust, because it guarantees you never get a false hit on parameters that actually matter (model, temperature, provider). The fuzzy matching is the feature people notice; the exact matching is the reason they can rely on it.&lt;br&gt;
If you want to see the whole thing, the code and the observability setup (Prometheus metrics, a Grafana dashboard) are in the repo. I'm still finding edge cases in the threshold tuning — 0.8 cosine similarity is a reasonable default, but "reasonable" and "right for your traffic" are not the same thing, and I don't think there's a way around testing that yourself.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
