My Durable Objects alarm loop burned CPU for 3 days before I noticed — here's what the docs miss

#cloudflare #serverless #webdev #tutorial

A 30-line Slack notifier was eating 60% of my CPU-time budget across 4 client Workers. The culprit: a runaway alarm loop I assumed was impossible.

Here's the part Cloudflare's docs don't spell out clearly: when your alarm() handler throws repeatedly, the platform retries with exponential backoff — but the backoff has a ceiling. In my testing it caps around 30-minute intervals after roughly 5 consecutive failures. After that, the alarm keeps firing on that ceiling interval indefinitely. It does not give up. It does not drop the alarm. A broken handler will keep your DO alive and burning CPU-ms until you intervene or delete the object entirely.

The fix that actually changed my production behavior was catching exceptions inside alarm() and rescheduling manually instead of letting the platform control retry timing:

async alarm(): Promise<void> {
  try {
    await this.runBatchInsert();
  } catch (err) {
    console.error("alarm handler failed:", err);
    const next = Date.now() + 60_000;
    await this.ctx.storage.setAlarm(next);
  }
}

If you rethrow, you get exponential backoff you can't configure. If you catch and reschedule yourself, you get a fixed 60-second window — or whatever fits your external API's rate limits. For anything touching D1 or a third-party endpoint, I want that control.

The other thing worth knowing: scheduling an alarm counts as storage state. Even if you delete every application key from a DO's storage, the alarm timestamp keeps the object alive. I've used this deliberately to build zero-key heartbeat DOs — lightweight, no application data, just a recurring alarm keeping the object warm.

What I didn't cover here: the generation-counter pattern for discarding stale alarm invocations, a watchdog Worker (Cron Trigger + KV scan) that catches missed alarms for billing jobs, and an alarm tombstoning approach for non-idempotent work like invoice writes where exactly-once semantics actually matter.

I wrote up the full breakdown — including all three recovery patterns with production code — over on dailymanuallab.com.

Full post →