The cron job that failed silently for three weeks (and the dead-man's-switch I built after)

#devops #monitoring #showdev #webdev

A backup script of mine stopped running on a Tuesday. I found out the following Tuesday — three weeks later — when I actually needed one of those backups and the most recent file was 21 days old. The cron entry was still there. The server was up. The script just... wasn't running, and nothing told me.

That's the thing about scheduled jobs: when they work, they're invisible, and when they stop, they're also invisible. A failed web request throws a 500 someone notices. A cron job that quietly dies makes no noise at all until the day you need what it was supposed to produce.

The fix for this is an old idea called a dead man's switch. Instead of asking "did this job error?", you flip it around: the job pings a URL every time it finishes, and something else watches for the ping. If the ping doesn't arrive on schedule, that something else alerts you. The absence of a heartbeat is the signal. I got tired of half-wiring this with Healthchecks-style snippets across boxes, so I built a small hosted version of it: CronCanary.

How it works

It's about as boring as monitoring should be:

You create a check and pick how often the job should run (every hour, daily at 06:00, whatever).
You add one line to the end of your job: curl https://croncanary.dev/ping/<your-token>.
If a ping doesn't show up inside the window you set, CronCanary emails you. That's it.

No agent to install, no daemon to keep alive, no library to import. If your job can make one HTTP request when it finishes, it can be monitored — which means it works for a bash cron line, a GitHub Action, a Kubernetes CronJob, a Lambda, or that one Python script on a Raspberry Pi in your closet.

Why a Cloudflare Worker

The whole backend is a single Cloudflare Worker with D1 for state and a cron trigger that sweeps for overdue checks. That choice does a few useful things:

The ping endpoint lives on the edge, so recording a heartbeat is a cheap, low-latency write from anywhere in the world.
There's no origin box of mine that can go down and silently stop watching your jobs — which would be a darkly funny way for a monitoring tool to fail.
It's cheap enough to run that a genuinely useful free tier is sustainable rather than a loss-leader trick.

I dogfood it, too: CronCanary monitors its own scheduled sweep, and that check's status is public. Last 90 days it's been at 100% uptime, and you can watch it live here: https://croncanary.dev/status?id=05725befcd7e48e79fb281c15b1ae49afa8b6cfc — including a public status badge you can embed, same as the one I use.

What it deliberately isn't

It's not full APM. It doesn't trace your code or profile anything. It answers exactly one question — did this thing run when it was supposed to? — and tries to answer it well.
It won't tell you why a job died, only that it didn't check in. That's usually enough to go look; sometimes it isn't.
It's a small tool that does one job. I'd rather it be obviously-correct about that one job than mediocre at ten.

There's a paid tier on the pricing page for more checks and shorter intervals, but the free tier is real and enough for a handful of personal cron jobs.

Try it: https://croncanary.dev

Genuine question for anyone who runs scheduled jobs in production: what actually pages you today when a cron job silently stops? A heartbeat monitor, a log alert, a dashboard you remember to check, or — like me for three weeks — nothing? I'm trying to figure out whether the "nothing" case is as common as I suspect.

DEV Community

The cron job that failed silently for three weeks (and the dead-man's-switch I built after)

How it works

Why a Cloudflare Worker

What it deliberately isn't

Top comments (0)