The 50ms promise I made in v1.6

#ai #api #edge #latency

A week ago I shipped Prism's edge layer and wrote a blog post explaining what was good about it and what was wrong with it. The good: bad keys now get rejected from your nearest Cloudflare PoP, no Mumbai round-trip. The wrong: cache hits were still landing in 300–500 milliseconds because the cache itself — Upstash Redis — lives in Mumbai and is single-region.

I'd promised 50 milliseconds in the design doc. I shipped 300–500. I told you that, and I said v1.6.5 would close the gap. Today is v1.6.5.

The thing I shipped

Cloudflare Workers KV is a key-value store that's globally replicated. You write to it once via the Cloudflare API, and within seconds the value is on every Cloudflare datacenter in the world. Reads are local to whichever PoP your request lands at — typically 30–80 milliseconds anywhere.

v1.6.5 is two changes:

The Mumbai backend, when it stores a new cache entry to Upstash, also fires a parallel write to Workers KV. Both writes are fire-and-forget; the customer's response doesn't wait on either.
The Cloudflare Worker, when it gets a cacheable request, checks Workers KV first. If KV has the entry, it serves the response from there — no Mumbai trip. If KV misses, it falls through to Upstash exactly like v1.6 did, and writes-through to KV in the background so the next request anywhere in the world serves from KV.

Upstash stays the source of truth. KV is a read-through cache, eventually consistent on a roughly 60-second timescale. The fallback path means correctness is never at stake — if KV is stale or empty for any reason, the worker reads Upstash and the customer sees a v1.6-era latency profile on that one request. By the time they make the second one, KV has caught up.

The numbers

I ran a three-request smoke test from Singapore (a real Cloudflare PoP, not a curl from my Mumbai laptop pretending) yesterday. Same request three times in a row, same API key, same model. New X-Prism-Edge-Cache-Layer response header tells you which layer served the hit.

Request 1 (cache miss):                    6.52s   passthrough to Mumbai
Request 2 (Upstash hit, KV not yet warm):  0.48s   x-prism-edge-cache-layer: redis
Request 3 (KV hit after write-through):    0.18s   x-prism-edge-cache-layer: kv

Request 3 is the steady-state number for an international customer. 184 milliseconds. That includes the time to parse the request, run the classifier, build the cache fingerprint, do the KV read, and serialize the response. The actual KV read is about 40 milliseconds; the rest is worker overhead.

For comparison: the v1.6 number for the same request from Singapore was the Request 2 line — 484 milliseconds. The v1.5 number, before the edge layer existed at all, was about 700 milliseconds. So v1.6.5 is about 3.8× faster than v1.5 and 2.6× faster than v1.6.

From San Francisco the numbers should be more dramatic, because the v1.6 Mumbai round-trip from there is closer to a full second; KV from a US PoP is still ~50 milliseconds. I haven't been able to run a curl from a real US machine yet, but the geographic math suggests the speedup ratio is closer to 5×. I'll measure properly when there's enough customer traffic to read it off the dashboard instead of synthetic smoke tests.

The thing I'd budgeted two days for

I estimated v1.6.5 at two days. It took roughly four hours, including writing this post. Three things made it cheap.

The fingerprint port from v1.6 was already done. Putting the cache lookup at the edge in v1.6 meant porting Python's exact-cache fingerprint to TypeScript and pinning it byte-equal with fixture tests. That's the hardest part of any "have Mumbai and the edge agree on cache keys" project. v1.6.5 reused all of it — the worker calls the same buildFingerprint function it had been using to look up Upstash, just now to look up KV first.

Workers KV's API is exactly the shape we needed. A PUT with a ?expiration_ttl= query parameter mirrors Redis SETEX semantics one-to-one. A GET is just a GET. There's no special protocol, no client library, no schema. The dual-write code in the Mumbai backend is one httpx.AsyncClient PUT call, scheduled as an asyncio.create_task so the customer response doesn't wait on Cloudflare's API.

Eventual consistency was already the right model. I'd assumed I'd need to build something clever to handle the case where the worker reads KV before Mumbai's dual-write has propagated. The clever thing turned out to be: read KV, miss, read Upstash, hit, write-through to KV in ctx.waitUntil. That handles the propagation lag, the case where KV has expired but Upstash hasn't, the case where the dual-write from Mumbai failed entirely, and the case where the entry was written before v1.6.5 existed. One mechanism, four bugs avoided.

The over-estimate isn't a bug exactly — when I scoped v1.6.5 I was thinking about correctness pitfalls (consistency, partial writes, invalidation propagation, cost overruns) and budgeted time to fix them. Most of them turned out not to be problems. That's a happy outcome but it's also the kind of estimate where the right thing in hindsight would have been to ship it the same day as v1.6 and skip the "ten paragraphs of honest reporting on what's gated" middle act.

What this means for v1.7

The Prism reliability story now has all four pillars in place:

v1.4 Policy + Governance — customers can deny models and cap budgets per project, with an audit log.
v1.5 Router Hardening — providers that are down get demoted automatically via a rolling-window health gate; outage windows get speculative parallel routing to avoid head-of-line blocking.
v1.6 Edge Routing — auth and cache lookup happen at the customer's nearest PoP; bad keys never reach Mumbai.
v1.6.5 Workers KV replication — cache hits serve in ~30-80ms anywhere, not just for Indian customers.

The remaining gaps are about product breadth, not infrastructure. Semantic cache at the edge is one (would need Workers AI for the embedding). A second region for Mumbai cold-paths is another (Caddy + a second EC2 is the obvious shape). Both are weeks not hours.

For the customer in San Francisco who started this whole arc — the one whose 250ms RTT to Mumbai I keep using as the framing for these edge posts — v1.6.5 is the first release where I'd actually look them in the eye and say "use Prism, it's not a worse choice than calling OpenAI directly." That's the bar I was trying to hit. I think we hit it.

If you want to verify: hit any /v1/chat/completions endpoint at api.ssimplifi.com twice with the same payload, then a third time after a few seconds. Look at X-Prism-Edge-Cache-Layer on the third response. If it says kv and the total time is under 200ms from anywhere outside Mumbai, the promise from v1.6 is paid back.

DEV Community

The 50ms promise I made in v1.6

The thing I shipped

The numbers

The thing I'd budgeted two days for

What this means for v1.7

Top comments (0)