Putting Prism's front door on every continent

#ai #api #edge #latency

Until today, every request to api.ssimplifi.com — whether it came from Bangalore, London, or San Francisco — had to physically reach an EC2 instance in Mumbai before anything happened. For our Indian customers that's invisible: under 20 milliseconds of network time before Prism does its first byte of useful work. For a developer in San Francisco, it was a different story: roughly 250 milliseconds each way before we'd even classified the request, before we'd checked the cache, before we'd called the AI provider. We could be the fastest proxy in the world internally and the customer in California would still see half a second they shouldn't have to.

v1.6 fixes the half-second. The way it fixes it is the part that took some thinking.

The simple version

api.ssimplifi.com is now fronted by a Cloudflare Worker. Cloudflare runs that worker in 300+ cities around the world. A request from San Francisco hits the Cloudflare datacenter in San Francisco; a request from London hits London; a request from Bangalore hits Bangalore. The worker does three things at that local datacenter:

Parses the Authorization header, hashes the API key, and checks whether it's valid. If it's malformed or unknown, it returns a 401 right there — never round-trips to Mumbai. That's about 250 milliseconds saved for every bad request, every opportunistic scan, every developer who pasted the wrong key into their code.
Looks up the exact cache. Prism caches every identical request for an hour by default. If your app is asking the same question it asked five minutes ago, the worker pulls the cached answer and returns it directly — never touches Mumbai.
Forwards everything else to Mumbai. Streaming responses, cache misses, semantic cache (the "rephrased same question" matcher), policy and budget enforcement, the v1.5 router hardening, billing — all of it still runs in Mumbai unchanged. The worker is a front door, not a re-implementation. We don't trust an edge re-implementation of pricing or policy enforcement; those are the parts where a bug costs money.

Every response now carries two new headers so you can see what happened:

X-Prism-Edge-Cache: hit | passthrough | auth-reject
X-Prism-Edge-Region: SIN | SFO | FRA | BOM | ...

The IATA airport code in X-Prism-Edge-Region is the Cloudflare PoP that handled your request. Useful for debugging "is my client hitting the edge I expect?"

Honest reporting on latency

Here's what I'd love to tell you: cache hits now return in 50 milliseconds from anywhere in the world. That's the dream version of v1.6.

That's not what shipped today. What shipped is a meaningful improvement that doesn't quite reach the dream, and I'd rather tell you the truth and follow up than have you measure it and find I oversold.

The auth-rejection win is real and clean. A malformed API key now gets a 401 from the customer's nearest Cloudflare PoP — anywhere from a few milliseconds in their region to maybe 50ms cross-region. Compared to the previous 600ms round-trip to Mumbai-and-back, that's a flat ~95% reduction in latency for bad-key requests. More importantly, those requests never burn Mumbai compute, which means our actual customers' good requests get less queue time.

The cache-hit win is partial. When the edge finds a cached entry, it returns it directly. But the cache itself — the Upstash Redis instance that stores those entries — lives in Mumbai, single-region. So even the edge has to round-trip to Mumbai to read the cache. From Singapore that's about 70 milliseconds each way; from San Francisco it's 250 milliseconds each way. So edge cache hits today land at roughly 300–500 milliseconds end-to-end, depending on geography. That's about 200ms faster than the previous Mumbai-origin path, but it's not the 50ms I'd promised in the design doc.

The fix is a small follow-up release. Cloudflare Workers KV is a key-value store that's globally replicated — when you write to it, the value propagates to every Cloudflare datacenter within seconds, and reads are local. Putting the exact-cache entries in Workers KV (with Upstash as the source of truth + cron-replicated) is roughly two days of work. That's v1.6.5. After it ships, cache hits will actually be ~30–50 milliseconds from anywhere — the dream version of this release.

I'm telling you all of this because I'd rather lose a little narrative now than have a customer benchmark v1.6 and feel misled. The architecture is right; the optimization just isn't done.

Some choices that mattered

A few decisions during the build that I'd make the same way again, plus one I had to undo.

Workers Custom Domain instead of Workers Routes. The standard pattern for putting a Worker in front of a hostname is to attach a "route" and turn on the orange cloud in Cloudflare DNS. We tried that first. It broke spectacularly: api.ssimplifi.com is two levels deep under the apex ssimplifi.com, and Cloudflare's free Universal SSL doesn't cover two-level subdomains. The TLS handshake failed for every request the moment we flipped the orange cloud, and we had a brief period where customers couldn't reach the API at all. We rolled back within minutes, switched to Workers Custom Domain (which auto-provisions a per-hostname certificate regardless of nesting depth), and that's been smooth ever since. No paid tier required. If you're putting a Worker in front of a deep subdomain on the free Cloudflare plan, Custom Domain is the right answer — Routes is a footgun for that case.

Byte-identical fingerprints between Python and TypeScript. This was the single piece of v1.6 that absolutely had to be right. The cache fingerprint is a SHA-256 over a JSON-serialized representation of the request, and Python's json.dumps differs from JavaScript's JSON.stringify in three small but lethal ways: Python uses spaces after commas and colons; Python's ensure_ascii=True escapes non-ASCII as \uXXXX; and Python serializes float 1.0 as the string "1.0" while JavaScript serializes it as "1". If any one of these had been off in the TypeScript port, every single cache lookup would miss and the entire feature would silently do nothing useful. So we wrote a Python-compatible JSON serializer in TypeScript, with a pyFloat() sentinel for integer-valued floats, and pinned the behavior with fixture tests against Python-generated hashes. Four fixtures pass byte-equality, the most paranoid possible verification, and the test runs on every change.

Don't put policy enforcement at the edge. It's tempting. Customer adds a deny-list rule, edge enforces it, no Mumbai round-trip. The trouble is: v1.4 policies live in Supabase, and we'd need either a slow Supabase fetch on every request (defeats the point) or a complex propagation scheme that turns "deny opus" into a global edge-cache invalidation. Easier and safer: policy stays in Mumbai. The worst case is a request that would have been denied by policy makes it as far as Mumbai before getting its 403. That's an extra ~150ms of wasted compute on the deny path — not free, but absolutely worth not having a cache-coherence bug that decides to silently allow blocked traffic for an hour.

Stats writes via ctx.waitUntil. When the edge serves a cache hit, the worker bumps the same Redis stats hash that Mumbai writes to. We do this via Cloudflare's ctx.waitUntil() mechanism, which schedules the work to happen after the response is returned to the customer. The customer never waits on the stats write; the dashboard counter still ticks up. Small thing, but the kind of small thing that means edge hits are the same latency as if we weren't tracking them.

What this changes

If you're an Indian customer reading this from Bangalore: you'll see almost no difference, because Mumbai-to-Mumbai-via-edge is the same hop count as direct. We're not your bottleneck. Your bottleneck is the AI providers, and the v1.5 work made that one much more survivable.

If you're an international customer: today you'll see your bad-key rejection requests come back faster, and you'll see cache hits land roughly 200ms quicker. Both real wins. Neither is the headline number. The headline number lands in v1.6.5, and I'd encourage you to wait that out before doing a proper benchmark.

If you're an SRE thinking about adopting Prism: the architectural answer is now what you'd want. Front door close to your users; control plane in one region; cache that wants to be globally replicated but isn't yet. The shape is right. The optimizer needs another pass.

The reason this work exists at all is that "we make AI cheaper" and "we make AI observable" and "we make AI survivable" — the previous three pillars — are all about features that customers value. They wouldn't have been enough if the customer in San Francisco couldn't reach the front door without waiting half a second first. v1.6 unblocks that conversation. v1.6.5 wins it.

DEV Community

Putting Prism's front door on every continent

The simple version

Honest reporting on latency

Some choices that mattered

What this changes

Top comments (0)