EventDock

Posted on Mar 23 • Edited on Mar 31 • Originally published at eventdock.app

I Built a Webhook Relay on Cloudflare Workers. Here Are the Bugs That Killed It

#webdev #javascript #cloudflare #programming

Four production bugs. All invisible in staging. Each one silently dropping data while every dashboard metric looked perfectly healthy. Here is what I learned building a webhook relay on Cloudflare Workers — and why distributed systems fail in ways that look like success.

I built EventDock, a webhook reliability layer that sits between webhook providers (Stripe, GitHub, etc.) and your application. The idea is simple: accept the webhook instantly, store it durably, and deliver it to your endpoint with retries, logging, and a dead letter queue.

I chose Cloudflare Workers as the platform. Edge compute seemed like the perfect fit — webhook providers have short timeouts (Stripe gives you ~20 seconds), so you want to ACK as fast as possible. A Worker can respond in under 50ms from the nearest edge node. No cold starts, no servers to manage, global by default.

The architecture works beautifully on paper. Getting it to work reliably in production required finding and fixing bugs that were invisible in development and staging. Here are the four that almost killed the project.

The Architecture

Before diving into the bugs, here's the flow:

Provider (Stripe, GitHub, etc.)
  → CF Worker (ingest) — validates, stores to D1, enqueues
    → CF Queue — at-least-once delivery semantics
      → CF Worker (delivery) — fetches payload, delivers to customer endpoint
        → Customer's app

The supporting cast:

D1 (SQLite at the edge) — stores event metadata, delivery state, and retry counts
KV — idempotency keys and deduplication
R2 — payload storage for large webhook bodies
Cron triggers — a recovery mechanism that finds stuck events and requeues them

The key design decision: the ingest worker does minimal work. Accept the webhook, write to D1 and the queue, return 200. Everything else happens asynchronously. This keeps the p99 response time under 100ms, which matters when Stripe is waiting for your response.

Bug #1: The Unwaited Retry

This one was subtle and devastating. In the queue consumer (the delivery worker), when a delivery attempt failed and needed to be retried, the code looked like this:

// The queue consumer processes batches of messages
async queue(batch: MessageBatch, env: Env) {
  for (const msg of batch.messages) {
    const event = msg.body;
    const result = await deliverWebhook(event, env);

    if (!result.ok) {
      // Schedule retry with exponential backoff
      handleRetry(event, env);  // ← THE BUG
    }

    msg.ack();
  }
}

handleRetry() is an async function. It writes to D1 to update retry count and schedules the next attempt. But it wasn't being awaited.

In Node.js, this would be a fire-and-forget — the retry would probably still complete in the background. In Cloudflare Workers, it's a death sentence. Workers are request-scoped. Once you call msg.ack() and the queue batch handler returns, the runtime can (and will) terminate the execution context. The unresolved promise from handleRetry() just... disappears.

The fix was one word:

// Before (broken)
handleRetry(event, env);
msg.ack();

// After (fixed)
await handleRetry(event, env);
msg.ack();

Why it was hard to find: Events appeared to be processing. The first delivery attempt worked fine. It was only events that needed retries that silently vanished. And in development, you're usually testing against endpoints that succeed. The failure mode only appeared under real production conditions — intermittent endpoint failures, timeouts, 500s.

The monitoring showed events being consumed from the queue (good!) but some never reaching a final delivered/failed state. They just sat in "delivering" forever. Classic fire-and-forget symptom, but only obvious in retrospect.

Bug #2: Queue Send Without Try-Catch

The ingest handler had a straightforward flow: validate the webhook, store the event in D1, then send it to the queue for delivery.

// Ingest handler (simplified)
async function handleWebhook(request: Request, env: Env) {
  const payload = await request.json();
  const event = createEvent(payload);

  // Step 1: Save to D1 (durable state)
  await env.DB.prepare(
    'INSERT INTO events (id, endpoint_id, payload, status) VALUES (?, ?, ?, ?)'
  ).bind(event.id, event.endpointId, event.payload, 'pending').run();

  // Step 2: Enqueue for delivery
  await env.QUEUE.send({
    eventId: event.id,
    endpointId: event.endpointId
  });
  // ↑ No try-catch. If this throws, the event is saved but never queued.

  return new Response('OK', { status: 200 });
}

The problem: QUEUE.send() can fail. Cloudflare Queues are generally reliable, but they're a distributed system — transient errors happen. When the queue send failed, the error propagated up and the handler returned a 500 to the webhook provider. But the D1 write had already committed.

So now you have an event in the database with status "pending" that will never be picked up by the queue consumer. The customer sees it in their dashboard as "received" but it never gets delivered. It's a zombie event.

// After: catch queue failures, mark for cron recovery
await env.DB.prepare(
  'INSERT INTO events (id, endpoint_id, payload, status) VALUES (?, ?, ?, ?)'
).bind(event.id, event.endpointId, event.payload, 'pending').run();

try {
  await env.QUEUE.send({
    eventId: event.id,
    endpointId: event.endpointId
  });
} catch (e) {
  // Event is in D1 — the cron recovery job will find it
  // and requeue it. Log the error but return 200 to the provider.
  console.error('Queue send failed, event will be recovered by cron', event.id, e);
}

// Always return 200 — the event is durably stored regardless
return new Response('OK', { status: 200 });

The insight: the D1 write IS the source of truth. If the event is in the database, the system should eventually deliver it. The queue is an optimization for fast delivery, not the guarantee. The guarantee comes from the cron recovery job that scans for stuck events.

Which brings us to bug #3.

Bug #3: The Safety Net That Never Ran

From the very beginning, I built a cron-triggered recovery system. Every few minutes, a Worker runs, queries D1 for events stuck in "pending" or "delivering" for too long, and requeues them. This was supposed to be the safety net for exactly the kind of failure in Bug #2.

The code was solid. Tested in development. Ready to go.

One problem: the cron trigger in wrangler.toml was commented out.

# wrangler.toml (production env)

[env.production]
name = "eventdock-worker-prod"
# ... other config ...

# [env.production.triggers]
# crons = ["*/5 * * * *"]

I had commented it out during early development (probably debugging something) and never uncommented it. The deployment pipeline didn't warn about missing triggers. There's no "expected crons" health check. The recovery system silently didn't exist for months.

During those months, any event that hit Bug #2's failure mode (or any other edge case where the queue send didn't fire) was just... gone. Saved in D1 but never delivered and never recovered.

The fix was uncommenting two lines. The lesson was harder: your safety nets need their own monitoring. If the recovery cron hasn't run in the last 10 minutes, that itself should be an alert.

[env.production.triggers]
crons = ["*/5 * * * *"]

A note on Wrangler v3 syntax: cron triggers for specific environments need the [env.production.triggers] block with a crons = [...] array. Not [triggers] at the top level (that's the default env), and not cron = (singular). This particular config syntax isn't well-documented, and getting it wrong means your crons silently don't deploy.

Bug #4: The Ghost Consumer

This one was the most disorienting to debug. Events were being enqueued correctly (I could see the queue depth increasing and then dropping back to zero), but some events were never delivered. The delivery worker's logs showed nothing — no attempts, no errors, nothing. Events just vanished from the queue without a trace.

The cause: my wrangler.toml had a queue consumer binding in the default (top-level) environment, not just in [env.production].

# wrangler.toml (the problem)

# Default env — leftover from initial setup
[[queues.consumers]]
queue = "eventdock-deliveries"
max_batch_size = 10

# Production env — the "real" consumer
[env.production]
[[env.production.queues.consumers]]
queue = "eventdock-deliveries"
max_batch_size = 10

When I deployed with wrangler deploy --env production, it deployed the production worker. But the default environment's consumer config meant there was a previous worker deployment (from running wrangler deploy without --env during early development) that was ALSO registered as a consumer on the same queue.

Cloudflare Queues distributes messages across all registered consumers. So roughly half the events went to the production worker (which delivered them correctly) and half went to a stale, unmonitored worker that consumed them and did nothing useful.

The fix was removing the default environment's consumer config and deleting the stale worker deployment. But finding it required going to the Cloudflare dashboard, looking at the queue's consumer list, and realizing there were two workers consuming from the same queue.

The debugging process took hours because every metric looked "kind of right." Queue depth went up, queue depth went down, delivery rate was non-zero. It was only when I compared "events enqueued" to "events delivered" over a 24-hour window that the ~50% loss rate became obvious.

Edge Compute Gotchas (Things I Wish I'd Known)

Beyond the specific bugs, here are the platform-level lessons from running a production system on Cloudflare Workers:

D1 is SQLite, and that matters. It's excellent for reads and great for the kind of workload a webhook relay generates (mostly inserts and point lookups). But SQLite has write contention characteristics you need to understand. Concurrent writes to the same database serialize. Under high throughput, your D1 write latency can spike. Batch your writes where possible.

CF Queues are at-least-once, not exactly-once. This is documented, but you need to internalize what it means: your queue consumer MUST be idempotent. The same message can be delivered multiple times. If your delivery handler isn't idempotent, you'll send duplicate webhooks to your customers. We use KV-based deduplication keyed on event ID.

No setTimeout, no background work. Workers are request-scoped. When the request handler (or queue batch handler) returns, your execution context can be terminated. Any work you haven't awaited is at risk. This is fundamentally different from a long-running Node.js server where fire-and-forget async calls will "probably" complete. On Workers, they probably won't.

Wrangler config is your infrastructure-as-code. Unlike traditional IaC tools (Terraform, Pulumi), Wrangler doesn't have a plan/apply cycle. wrangler deploy just does it. There's no diff, no confirmation, and misconfigured environments are silent failures. Treat wrangler.toml with the same rigor you'd treat a Terraform file — code review every change.

What I'd Do Differently

If I were starting over, I'd invert the delivery architecture. Instead of queue-based delivery as the primary path with cron recovery as a safety net, I'd make the cron the primary mechanism.

Here's why: the cron recovery pattern is inherently reliable. It scans D1 for undelivered events, attempts delivery, and updates state. It doesn't depend on queue health, consumer registration, or message acking semantics. It's a simple polling loop backed by a durable database.

The queue would become a performance optimization — a fast path that delivers events within seconds instead of waiting for the next cron tick. But the system would be correct and complete with ONLY the cron. The queue just makes it faster.

This is the same insight behind patterns like the transactional outbox pattern: write to the database first, process asynchronously second. The database is the source of truth, and the async mechanism is an optimization.

Where It Stands Now

After fixing all four bugs, the system works well:

Ingest latency: sub-100ms p99 (usually under 50ms)
Delivery: 7 retries with exponential backoff over 2+ hours
Recovery: cron scans every 5 minutes for stuck events
Dead letter queue: events that exhaust all retries go to DLQ for manual inspection
Idempotency: KV-based dedup prevents duplicate deliveries

The common thread through all four bugs: distributed systems fail in ways that look like success. Events appeared to be processing. Queue depth looked healthy. The dashboard showed events being received. But under the surface, promises weren't being awaited, queues were silently failing, safety nets weren't running, and ghost workers were eating messages.

The only way to find these bugs was to compare inputs to outputs at every stage: events received vs. events enqueued vs. events delivered. Any discrepancy means something is silently dropping data. If you're building on Cloudflare Workers (or any edge compute platform), build that end-to-end observability from day one. You'll need it.

EventDock is a webhook reliability layer for teams that can't afford to lose events. If you've hit these kinds of problems and don't want to build the infrastructure yourself, check out eventdock.app.

Top comments (1)

CloakHQ • Mar 24

The ghost consumer bug is the kind of thing that's genuinely hard to catch because every individual metric looks fine. Queue depth going up and down looked healthy. It's only the input-vs-output comparison across a full window that exposes it.
The "safety nets need their own monitoring" lesson from bug 3 applies everywhere. We had a similar blind spot in a browser session recovery system - the cleanup job was running, but we never verified it was actually cleaning up. A monitor that checks "did this job produce output in the last N minutes" is different from a monitor that checks "did this job run".