Achieving sub 100ms response times at the edge with Cloudflare Workers: lessons from analytics in production

#cloudflare #webdev #performance #analytics

Running web analytics ingestion at the edge sounds simple. POST event, write to database, return 200. The reality has more nuance.

This is a technical writeup of how Zenovay (my analytics SaaS) handles 50M+ monthly events with sub 100ms p99 response times globally on Cloudflare Workers.

The naive version

export default {
  async fetch(request, env) {
    const event = await request.json()
    await env.DB.prepare(
      'INSERT INTO events (...) VALUES (...)'
    ).bind(...).run()
    return new Response('ok')
  }
}

This works. It also blocks the response on a database write. P99 latency was 280ms.

Step 1: Decouple the write

Workers have ctx.waitUntil() which lets you do work after returning the response.

export default {
  async fetch(request, env, ctx) {
    const event = await request.json()
    ctx.waitUntil(persistEvent(event, env))
    return new Response('ok', { status: 202 })
  }
}

Response time dropped to ~40ms. But now you have a reliability question: what if the worker dies before the write completes?

Step 2: Queue the write

Cloudflare Queues turned out to be the right abstraction. Workers push events to a queue, a consumer worker handles writes in batches.

P99 now 38ms. Throughput went up because batched inserts are way more efficient than single row inserts.

Step 3: Edge caching for reads

The dashboard side has different constraints. Most dashboard queries can tolerate 5 to 10 seconds of staleness.

Workers KV is too slow for this (read latency ~50ms). Cache API on the response is way faster.

const cache = caches.default
const cacheKey = new Request(url, request)

let response = await cache.match(cacheKey)
if (!response) {
  response = await fetchFromOrigin(request, env)
  response.headers.set('Cache-Control', 's-maxage=10')
  ctx.waitUntil(cache.put(cacheKey, response.clone()))
}
return response

Dashboard query P99 went from 180ms to 22ms cached, 95% cache hit rate.

Step 4: Geographic routing

One thing that surprised me: even with edge workers, your D1 or origin database location matters. A worker in Singapore writing to a D1 instance in Frankfurt adds ~180ms latency.

For Zenovay I run regional write workers that funnel to the closest D1 replica, then async sync to the primary. This is more complex but cut p99 in APAC from 240ms to 75ms.

Numbers in production

P50 response: 18ms
P95 response: 62ms
P99 response: 89ms
Cache hit rate (dashboard): 95.4%
Cost per million events: ~$0.42

If you want to see this in action: zenovay.com. The 3D globe on the homepage is showing live events flowing through this exact pipeline.

Valerio

DEV Community