Tanzil Idrisi

Posted on Mar 8

I Reverse-Engineered Why LLM Caching Fails in Cloudflare- Then Built the Fix on Cloudflare

#cloudflare #ai #webdev #tutorial

Cloudflare AI Gateway is excellent at what it does. But it has one fundamental limitation that quietly costs teams building on LLMs thousands of dollars every month. I reverse-engineered exactly why, and then built the solution entirely on Cloudflare's own platform. This is that story.

Cloudflare AI Gateway Is Good. But There Is a Gap.

If you're routing LLM traffic through Cloudflare AI Gateway, you're already ahead. You get caching, rate limiting, retries, analytics, provider fallback, and a universal endpoint for OpenAI, Anthropic, Gemini, and more, all from one proxy.

But there is a gap, and once you see it, you can't unsee it.

By default, Cloudflare AI Gateway caches on exact request matches.

Cloudflare does provide a cf-aig-cache-key header that lets callers override the cache key. That helps if you know exactly which fields to exclude and your app consistently excludes them. But it still requires your application to pre-solve the problem, to know in advance which fields are noise, which are identity-critical, and which are freshness-sensitive, and to encode that logic correctly on every request, in every client. There is no canonicalization engine, no semantic equivalence detection, no volatility classification, no burst coordination, and no response safety validation. The key customization is a sharp tool, not a complete solution.

The gap isn't in Cloudflare AI Gateway's design. A general-purpose proxy correctly avoids making assumptions about what your prompts mean. The gap is in the layer that sits in front of it; the layer that understands prompt semantics, strips noise deterministically, detects near-duplicate intent, coordinates burst traffic, and validates responses before storage.

That layer didn't exist as a deployable, production-safe service. So I built it on Cloudflare.

The Problem, Quantified

Let's be specific about what "exact matching misses semantically equivalent requests" costs.

Here's a real production LLM request shape that many support bots generate:

POST /v1/chat/completions
{
  "model": "gpt-4o",
  "requestId":  "3f8a-9c2b-...",          ← UUID. Different every time.
  "traceId":    "span-00f2c...",           ← Different every time.
  "timestamp":  "2026-03-07T14:22:03Z",   ← Different every second.
  "sessionWrapper": {
    "sessionId":     "sess-xyz-789",      ← Different per user session.
    "clientVersion": "1.2.3"
  },
  "messages": [
    {
      "role":    "user",
      "content": "How do I  reset my password?"   ← Two spaces. Typo.
    }
  ]
}

The semantically meaningful content of this request is:

"How do I reset my password?"

That's it. Everything else is noise.

But Cloudflare AI Gateway, and every other exact-match cache, sees a completely unique request every single time this gets sent. Because requestId is always different. Because timestamp is always different.

Result: every one of these requests hits your LLM provider. Everyone costs tokens. Everyone adds 500-900ms of latency that your users experience.

Now multiply this pattern across your entire production traffic. Most teams building mature LLM products find that 30–70% of their provider calls are paying for responses they've already generated.

The Insight: Most Cache Misses Aren't Real Misses

When I reverse-engineered real LLM traffic, I categorized every field that causes a cache miss even when the intent, and the ideal response, is identical:

Field type	Examples	Effect on cache
Entropy	`requestId`, `traceId`, `timestamp`, `correlationId`	False miss, varies but doesn't change meaning
Session wrappers	`sessionId`, `clientVersion`, metadata envelopes	False miss, wrapper noise around identical content
Whitespace variation	Extra spaces, different line endings	False miss,same content, different bytes
Phrasing variation	"reset my password" vs "I want to reset my password"	False miss, same intent, different words
Identity markers	`userId`, `accountId`	True miss when answer is user-specific
Freshness language	"today", "latest", "current balance"	True miss, answer depends on current state

The first four categories are false misses. The last two are true misses.

An exact-match cache treats all six the same. A smarter layer should treat them very differently.

Introducing PromptKernel: The Intelligence Layer Above AI Gateway

I was building an internal support tool for a SaaS product, a knowledge-base assistant that answered customer questions using GPT-4o. Early on, the bill made sense: we were in beta, traffic was low, and every question felt genuinely new. But three months after launch, something was wrong. We had roughly 200 distinct questions that our users ever really asked. The product was mature. The patterns were obvious. Yet the OpenAI bill kept climbing, sitting at around $1,400 a month and showing no sign of plateauing. I pulled the logs and started counting. Out of every 100 requests, maybe 6 or 7 were actually unique questions we had never seen before. The other 93 were variations of questions we had answered dozens of times, same intent, different wording, different request IDs, different timestamps, different session wrappers. The cache was missing all of them. We were paying full inference cost, at full latency, for answers we already had.

That is when I started reverse-engineering the actual structure of our traffic. I took 500 consecutive cache misses, stripped every request down to just the user message, and started grouping them by meaning rather than by bytes. The pattern was immediate and alarming: 89% of those "unique" requests mapped to fewer than 30 distinct intents. The same 30 questions, asked in hundreds of different surface forms. What looked like a diverse stream of user queries was actually a very shallow pool of repeated intent wrapped in an ever-changing layer of noise, request IDs, trace IDs, session wrappers, timestamp metadata, whitespace variation, and slight phrasing differences. Every one of those noise fields was enough to produce a cache miss. None of them changed what the correct answer was. Once I had the pattern, the problem statement wrote itself: the cache was working exactly as designed, it just had no idea what the requests meant. So I built the layer that does. After deploying it, our monthly provider cost dropped from $1,400 to just under $100, roughly one-tenth, and response latency for cached traffic fell from an average of 800ms to under 20ms. That layer became PromptKernel.

PromptKernel is a Cloudflare Worker that runs as a reverse proxy in front of your LLM provider traffic. It sits in front of where Cloudflare AI Gateway would sit, or can sit alongside it.

Integration requires three new headers and an endpoint change. Everything else, your provider request body, your response format, your provider API key, stays the same.

# Before
POST https://api.openai.com/v1/chat/completions

# After
POST https://your-worker.workers.dev/v1/infer
x-scc-tenant-id: acme-corp
x-scc-app-id:    support-bot
x-scc-env:       production
# body stays exactly the same

Why We Built This on Cloudflare's Own Platform

This is the part of the story that we find most interesting to tell.

When we mapped out what the intelligence layer needed to do, we realized Cloudflare had already built every primitive we needed. We just needed to wire them together correctly.

Here's the mapping:

Cloudflare Worker: The Pipeline Engine

The entire 12-step request pipeline runs inside a Cloudflare Worker. That means it runs at the edge, geographically close to your users. Cache lookups, policy evaluation, and embedding generation all happen before the request ever leaves your region. Workers gave us V8 isolation, TypeScript support, zero cold-start time, and global deployment with no infrastructure management.

Cloudflare KV: Exact Cache Storage

After canonicalization strips noise and produces a deterministic hash, that hash becomes a Cloudflare KV key. KV is globally replicated, TTL-aware, and optimized for read-heavy workloads, exactly the right shape for a cache that gets written once and read many times. Exact hits return in ~8ms.

Cloudflare Vectorize: Semantic Similarity Search

This is where the capability gap gets bridged. On an exact cache miss, PromptKernel embeds the semantic projection of the request into a 128-dimensional vector and queries Cloudflare Vectorize. Vectorize returns the most similar past requests within a strict namespace boundary.

We're using Vectorize for something beyond the typical toy demo. The metadata filtering system, which filters by tenant, app, environment, request class, model group, normalization version, and embedding version simultaneously, makes semantic search production-safe. Without those filters, semantic cache hits would cross tenant boundaries. With them, the search is constrained to exactly the right namespace.

Semantic hits return in ~25ms.

Cloudflare Durable Objects: Distributed Coordination

This is the most sophisticated use of Cloudflare's platform in the stack, and it solves a problem that no other caching layer addresses: the cache stampede.

When 50 users send the same question simultaneously, they all miss the cache at the same moment. Without coordination, all 50 hit the provider. PromptKernel uses one Durable Object instance per hot cache key to elect a leader, hold followers in a short bounded wait, and let followers re-read KV after the leader completes. 50 provider calls become 1–2.

Durable Objects are the only Cloudflare primitive that provides the distributed, consistent, per-key state needed for this. A KV-based approach would have race conditions. An in-memory Worker approach wouldn't coordinate across instances. Durable Objects are exactly the right tool.

The Combined Stack

Cloudflare Worker         ←  Pipeline execution, edge-local
Cloudflare KV             ←  Exact cache (deterministic, global, TTL)
Cloudflare Vectorize      ←  Semantic index (vector search + metadata filtering)
Cloudflare Durable Objects ← Distributed coordination (per-key leader election)
Cloudflare R2 (planned)   ←  Long-term audit logs and eval datasets

What we built is an argument that this combination of Cloudflare primitives, Workers + KV + Vectorize + Durable Objects, is the right foundation for production AI infrastructure. Not just for caching. For any AI system that needs intelligence at the edge, semantic retrieval, and distributed coordination.

The Full Pipeline: How It Works Internally

Every POST /v1/infer runs through 12 steps:

The Safety Systems That Make It Production-Grade

Caching LLM responses sounds simple until you think about what can go wrong. We built three safety systems specifically to prevent the ways LLM caching fails in production.

Safety System 1: The Policy Engine

Not all traffic should be semantically cached. A support FAQ question and a live portfolio query are both LLM requests, but serving a cached portfolio answer to the wrong user would be a serious bug.

PromptKernel classifies every request before caching it:

Class	Default cache posture
`static_qa`	Exact + semantic, highest reuse potential
`support_chat`	Exact + semantic, strong reuse candidate
`document_chat`	Exact + guarded semantic
`internal_copilot`	Exact + guarded semantic
`agent_tool_call`	Exact only or bypass, tools = live external state
`realtime_conversation`	Guarded or bypass, session order matters
`evaluation_test`	Exact only, repeatability matters more than reuse

Any request that contains a freshness signal ("today", "latest", "current"), a tool declaration, or user-specific account state gets downgraded to a stricter safety level automatically. The volatility detection system emits named markers that explain every downgrade decision.

Safety System 2: Response Validation

A 2xx status code from the provider does not mean the response is safe to cache. We've seen production LLM responses that are:

Refusals ("I'm sorry, I can't help with that"): which should not be cached as the answer to a product question
Truncated (finish_reason = length): which are incomplete and wrong to replay
Hallucinated JSON: which fail schema validation silently
Transient state: which reference something true right now but not next week

PromptKernel validates every response before it enters the cache. If validation fails, the response is returned to the user but not written to KV or Vectorize. The rejection reason is logged.

Safety System 3: The Rollout Model

Rollout is its own safety system. "Safe to cache" and "safe to serve from cache in production today" are different questions.

PromptKernel separates them with a 7-stage rollout model:

Stage 0: Dark Mode
  Full pipeline runs. Metrics emit. Cache is never served.
  You learn your traffic patterns before risking anything.

Stage 1: Exact Cache Only
  Deterministic exact hits served. Semantic disabled.
  Maximum correctness, real production value.

Stage 2: Shadow Semantic
  Semantic lookup runs after exact misses.
  Results are logged but never served.
  You review what would have been semantic hits before committing.

Stage 3: Tenant Opt-in Semantic
  Semantic serving enabled only for allowlisted tenants.
  Controlled blast radius. One tenant at a time.

Stage 4: Request-Class Rollout
  static_qa → support_chat → document_chat
  Not all traffic at once.

Stage 5: Broader Production
  After threshold tuning and false-positive review.

Stage 6: Default-On Approved Profiles
  Only for explicitly proven safe policy profiles.

You can ship PromptKernel to production today at Stage 0, watch real traffic for a week, and flip to Stage 1 without a deployment. Rolling back from any stage to dark mode is a config change.

Tenant Isolation: The Property That Must Never Break

This is a non-negotiable rule in PromptKernel: no cross-tenant cache sharing by default.

Every cache namespace is partitioned by:

tenant × app × env × policy_profile × model_group × norm_version × embed_version

Two companies can ask the exact same question. Their cached responses stay completely isolated:

acme-corp: "What is the refund policy?"
  → namespace: acme:support:prod:...
  → hits acme-corp's Vectorize entries only

beta-inc: "What is the refund policy?"
  → namespace: beta:support:prod:...
  → cannot receive acme-corp's cached response
  → even if the vector similarity is 1.0

Embedding version changes trigger automatic namespace rotation. If your embedding model changes, old vectors don't silently contaminate new queries. This is enforced in the Vectorize metadata filters, not in application logic.

A Real Example: What This Looks Like in Production

A SaaS company runs a customer support chatbot. 8,000 messages arrive on an average day across 200 distinct questions. Here's what a typical day looks like:

9:00 AM: First user asks:

"how do i reset my password??"
requestId: "uuid-aaa", traceId: "span-bbb", timestamp: "09:00:01"

→ noise stripped (requestId, traceId, timestamp)
→ KV lookup: MISS  (first time ever seen)
→ Vectorize: MISS  (no vectors yet)
→ Forward to OpenAI: 820ms, 1 API call
→ Write KV + Vectorize

9:15 AM: Second user (different phrasing):

"How do I reset my password?"
requestId: "uuid-ccc", traceId: "span-ddd"

→ Canonical hash: DIFFERENT (different capitalization = different hash)
→ KV lookup: MISS
→ Vectorize: similarity = 0.97 ✓
→ Return from Vectorize: 22ms, $0
→ x-scc-cache-result: semantic_hit
→ x-scc-cache-source: vectorize

10:30 AM: Third user (exact same phrasing as first):

"how do i reset my password??"

→ Canonical hash: SAME as first user
→ KV lookup: HIT
→ Return from KV: 7ms, $0
→ x-scc-cache-result: exact_hit
→ x-scc-cache-source: kv

11:00 AM: User with account-specific question:

"Why was my account suspended?"
userId: "user_789", accountStatus: "suspended"

→ Volatility: userId = identity_affecting
             accountStatus = semantic_blocking
→ semanticCacheAllowed = false (correct, this answer is user-specific)
→ Forward to provider
→ Not cached
→ x-scc-bypass-reason: identity_marker + semantic_blocked

End of day results:

8,000 total messages
~200 unique provider calls
~7,800 served from cache
97.5% cache hit rate
Median latency: 12ms vs 750ms baseline
Provider cost reduced by ~97%

The Burst Traffic Problem: And Why Durable Objects Solve It

There's a production failure mode that most LLM caching discussions skip entirely: the cache stampede.

Imagine a popular question hits 50 concurrent users at the same moment. The cache is cold; this question just became popular. All 50 users miss the cache simultaneously. All 50 go to the provider. You pay for 50 API calls when 1 would have been enough.

Without Cloudflare Durable Objects, this problem is hard to solve correctly. An in-memory coordinator only works within one Worker instance. A KV-based coordinator has race conditions. You need a consistent, per-key state that survives across Worker instances, and Durable Objects are exactly that.

50 concurrent requests, same cache key, all miss:

Without coordination:           With PromptKernel + Durable Objects:
──────────────────────          ──────────────────────────────────────
req 1  → OpenAI  ($)            req 1  → LEADER → OpenAI → KV write
req 2  → OpenAI  ($)            req 2  → FOLLOWER → wait → KV read ✓
req 3  → OpenAI  ($)            req 3  → FOLLOWER → wait → KV read ✓
...                             ...
req 50 → OpenAI  ($)            req 49 → FOLLOWER → wait → KV read ✓
                                req 50 → OVERLOAD → OpenAI (cap exceeded)
50 API calls                    1–2 API calls

The Durable Object for each key tracks: leader request ID, follower count, start timestamp, and completion state. Followers wait a bounded 50ms. If the leader doesn't complete in time, followers fall back to independent forwarding. There is no indefinite blocking.

What We Proved About Cloudflare's AI Platform

Building and verifying PromptKernel against real Cloudflare infrastructure proved something we think is important for the Cloudflare ecosystem:

Workers + KV + Vectorize + Durable Objects is a complete stack for production AI infrastructure.

Not a demo stack. Not a prototype stack. A production stack with:

Global edge execution for intelligence at the request level (Workers)
Fast, globally-replicated, TTL-aware exact cache storage (KV)
Production-grade semantic search with metadata filtering (Vectorize)
Distributed, consistent coordination that eliminates stampedes (Durable Objects)

Most Vectorize documentation shows it being used for semantic search in isolation. Most Durable Object documentation shows it being used for simple counters or room state. PromptKernel shows what happens when you combine all four in a real AI workload, and what you can build when you do.

What This Means for Teams Already Using Cloudflare AI Gateway

If you're using Cloudflare AI Gateway today, PromptKernel is not a replacement; it's the layer that makes your AI Gateway caching dramatically more effective.

AI Gateway handles: provider abstraction, rate limiting, retries, observability, and account-level analytics.

PromptKernel handles: request meaning, canonicalization, semantic equivalence, burst coordination, response safety, rollout control.

They're complementary. PromptKernel sits in front and passes true misses through to AI Gateway (or directly to the provider). AI Gateway continues to do what it does well. PromptKernel adds the intelligence layer that turns most "misses" into hits before AI Gateway ever sees them.

GITHUB REPO: https://github.com/tanzil7890/prompt_kernel

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.