A user emailed last year to ask why their weekly export was two months stale.
I looked at the cron job. It was running. The logs said it was completing successfully. Except the logs were from two months ago, because the cron job had silently stopped running two months ago and nobody had noticed because there were no alerts, no errors, no way to notice - the job just... wasn't there anymore.
The cause was a deploy that changed an environment variable the job depended on. The job would start, hit the config error, and exit with a non-zero code. Cron would note this in syslog. Nobody was watching syslog.
This is the most common category of "invisible" production failure. The job exists. You can see it in crontab. It just isn't running. And nothing is watching whether it ran.
Why ping monitoring doesn't work for cron jobs
If your service has an HTTP endpoint, you can ping it. Cron jobs don't have HTTP endpoints. You could add one - a /status route that returns the last run time - but now you're building monitoring infrastructure into every job, and you still have to remember to check that endpoint on a schedule that aligns with when the job runs.
The bigger problem: a ping monitor checks whether something is reachable. For cron jobs, the question is whether something happened. Those are completely different things.
Your weekly backup job could be "reachable" in whatever sense you can ping it and get nothing, and that tells you nothing about whether the backup actually ran last Sunday at 3 AM.
Heartbeat monitoring: the inverted model
Heartbeat monitoring inverts the check. Instead of the monitor asking "is the service up?", the job tells the monitor "I finished successfully."
The pattern:
- Create a heartbeat monitor with a period (how often the job runs) and a grace period (how long after the expected time to wait before alerting)
- At the end of your cron job - after all the work is done, on the success path - send an HTTP ping to the heartbeat URL
- If the monitor doesn't receive a ping within
period + grace, fire an alert
That's it. The failure modes it catches:
- Job didn't run at all (cron config broken, cron daemon down, environment issue)
- Job ran but exited early with an error before reaching the success ping
- Job ran, succeeded, but took longer than expected (catches jobs that are silently degrading over time)
- Job was removed or disabled by accident
The beauty of it is that the monitoring lives outside the job. You don't have to instrument every job heavily - you add one line at the end of the success path.
The implementation
At the simplest level, heartbeat monitoring is just:
# at the end of your script, after all work is done
curl -fsS --retry 3 "https://grabdiff.com/ping/your-unique-slug" > /dev/null
GrabDiff gives you a unique unguessable URL for each heartbeat monitor. Set the period to match your cron schedule, set a grace period (I use 15 minutes for jobs that run hourly, a few hours for daily jobs), and you're done. If the ping doesn't arrive within that window, you get an email.
For application code instead of shell scripts, it's the same idea:
func runExportJob(ctx context.Context) error {
// do the work
if err := exportData(ctx); err != nil {
return fmt.Errorf("export failed: %w", err) // no ping sent
}
// only ping on success
if _, err := http.Get(os.Getenv("HEARTBEAT_URL")); err != nil {
slog.Warn("heartbeat ping failed", "err", err)
// don't fail the job over a monitoring ping failure
}
return nil
}
A few things worth noting:
Ping only on the success path. The monitor needs to distinguish "ran and succeeded" from "ran and failed." If you ping on both success and failure, you lose that signal.
Don't fail the job if the ping fails. Your backup job shouldn't fail because the monitoring endpoint was momentarily unreachable. Log it, but keep it separate from the job's exit code.
Include the job output in your error handling somewhere. The heartbeat tells you the job didn't run - it doesn't tell you why. Make sure your job logs to somewhere you can check when an alert fires.
Choosing period and grace
The period should match your cron schedule exactly. If your job runs 0 3 * * * (3 AM daily), your period is 24 hours.
Grace period is trickier. You want it long enough to not alert on jobs that take slightly longer than usual, but short enough to catch actual failures before they become a problem.
My rules of thumb:
- Minute-level jobs: grace = 5 minutes
- Hourly jobs: grace = 15 minutes
- Daily jobs: grace = 2–4 hours (depending on how bad a missed run is)
- Weekly jobs: grace = 12 hours
For anything with a business impact - billing runs, data exports, email digests - I keep the grace short and accept the occasional false positive from a slow run. A false positive is annoying. A missed billing run is worse.
The jobs worth monitoring
Every shop has a slightly different list, but the categories that almost always have critical unmonitored jobs:
Data exports and reports. Whatever you're generating for customers or stakeholders on a schedule. When these stop, you find out a week later.
Billing and subscription processing. Failed renewal attempts, expired trial follow-ups, invoice generation. Silent failures here have direct revenue impact.
Email digests and notifications. Users set expectations based on these. When they stop arriving, your support queue fills up.
Database backups. The one you really, really don't want to discover has been failing when you actually need a restore.
Search index updates. If your search depends on a nightly rebuild job and the job stops, search quietly degrades until someone notices results are stale.
Cache warming and pre-computation. These often run before peak traffic. If they don't run, you don't notice until peak traffic hits and things are slow.
Go through your crontab right now. For each job, ask: "How would I know if this stopped running?" If the answer is "a user would tell me," that job needs a heartbeat.
Start/fail endpoints for richer monitoring
Some heartbeat systems (including GrabDiff) support optional start and fail endpoints in addition to the success ping.
- Start endpoint: ping when the job begins. Lets you track job duration and alert if a job runs too long.
- Fail endpoint: explicit failure ping for when you want to differentiate "didn't run" from "ran and explicitly failed."
The start endpoint is the one I actually use regularly. Combined with the success ping, you get duration tracking. If a job that normally takes 3 minutes suddenly takes 45 minutes, that's worth knowing about even if it technically "succeeded."
#!/bin/bash
curl -fsS "https://grabdiff.com/ping/your-slug/start" > /dev/null
# ... do work ...
if [ $? -ne 0 ]; then
curl -fsS "https://grabdiff.com/ping/your-slug/fail" > /dev/null
exit 1
fi
curl -fsS "https://grabdiff.com/ping/your-slug" > /dev/null
This is probably more than you need for most jobs. One ping at the end of the success path is the right place to start.
The monitoring gap between "it exists" and "it ran"
The broader pattern here: there's a gap between "the system is configured to do a thing" and "the thing actually happened." Ping monitors cover the first half. Heartbeat monitoring covers the second.
Your cron job exists. Your renewal process is configured. Your backup job is in the schedule. The question is whether it's running.
For any job where the answer to "how would I know if this stopped?" is "I wouldn't," add a heartbeat. It's a one-line change to your script and a two-minute setup in your monitoring tool. The jobs I've seen cause the most damage are almost always ones where someone set them up, confirmed they ran once, and then never thought about them again until something downstream broke.
The two-month-stale export was embarrassing. The fix took about 90 seconds.
I wrote this because heartbeat monitoring is one of those things that nobody tells you about until after you've had the incident that makes you wish you'd known. It's not in most intro-to-devops content, and it probably should be.
If you've had a cron job go silent on you - or if you're running jobs right now that you realize have no heartbeat after reading this - drop a comment. I'm also curious whether anyone has a good pattern for monitoring jobs that are supposed to not run (maintenance windows, feature flags that disable background work). That one's trickier and I haven't landed on a clean solution.
Top comments (0)