deadping

Posted on Feb 28

5 Ways Your Cron Jobs Are Failing Silently (and How to Catch Them)

#devops #linux #monitoring #backend

Last month I was on call when a customer reported missing data in their dashboard. Server was fine. Uptime monitor showed green. APM had zero errors. Logs looked completely normal.

Turns out our nightly Stripe sync a cron job that pulls transaction data into our analytics DB had been dead for 11 days. The crontab entry was still there. The script was still there. A permissions error after a deploy killed it, and cron just... didn't tell anyone.

I've seen this exact scenario play out at three different companies now. Cron jobs break in ways that nothing else catches. Here are the five patterns I keep running into.

1. Job gets nuked during a deploy

Sneakiest one on this list. You push a deploy. The container rebuilds. Somewhere in the process, the crontab gets wiped or overwritten with a stale version.

Yesterday the job existed. Today it doesn't. And nothing fires an alert because there's no crash, no error, no log line just absence.

I've seen this happen from:

Docker rebuilds that don't preserve the crontab
Ansible runs overwriting /etc/crontab with a template someone forgot to update
Typing crontab -r instead of crontab -e (one letter, all your jobs gone)
A Helm chart upgrade that accidentally dropped a CronJob manifest

There's no error to catch here. The job just stops existing.

2. Job runs but takes forever

Your backup script runs in 3 minutes. Has for months. Then the database doubles in size and nobody adjusts the script. Now it takes 45 minutes. Then 3 hours. Then it starts overlapping with the next scheduled run.

Meanwhile, the server is getting hammered, other jobs are delayed, and your monitoring says everything is fine because technically, nothing errored.

Usual suspects:

Table growth hitting a query that doesn't have the right index
Network calls to external APIs starting to time out
Lock contention when two instances of the same job collide
Disk I/O getting saturated by competing processes

Most monitoring checks "did it error?" but nobody checks "did it finish in a reasonable amount of time?"

3. Non-zero exit code goes to /dev/null

Cron runs your script. Script hits an error. Exits with code 1. Cron writes the output to the local mail spool at /var/mail/root.

When was the last time you checked /var/mail/root?

# This fails silently on basically every server I've ever worked on
0 2 * * * /usr/local/bin/backup.sh

For cron email delivery to actually work, you need: a configured MTA on the box, a valid MAILTO in the crontab, and that email going to an inbox someone actually reads. In practice, almost nobody has this set up properly.

4. Job gets skipped and nobody knows

You've got a job running every 5 minutes. Normally takes 30 seconds. But one run takes 8 minutes because Postgres decided it was vacuum time. The next invocation fires while the first is still going.

What happens next depends on your setup:

No protection: Both run at once, maybe corrupting data or double-processing records
Lock file (flock): Second invocation sees the lock, exits quietly. No record of the skip anywhere.
K8s concurrencyPolicy: Forbid: Skips the run. Logs a Kubernetes event. That event expires in an hour. Gone.

The skip is working as designed. But "as designed" still means your data pipeline missed a window and nobody noticed.

5. crond itself dies

Rarest failure on this list, and the worst. The cron daemon gets OOM-killed, or a system update disables it, or (in K8s) the kube-controller-manager pod goes unhealthy. Every single scheduled job stops running at once.

Nothing alerts you. Because the thing responsible for running your alerting job is the thing that broke.

I've only seen this happen twice, but both times it took days to notice.

The fix that actually works

All five of these failures have the same root problem: you're watching for something to go wrong instead of watching for something to go right.

The dead man's switch pattern flips this. Instead of monitoring for errors, you monitor for the absence of success:

Your job finishes successfully and pings an HTTP endpoint
A monitoring service knows when to expect that ping
If the ping doesn't show up on time, you get alerted

That's it. And it catches everything:

Job deleted → no ping → alert
Job stuck → ping is late → alert
Job errored → no success ping (because of &&) → alert
Job skipped → no ping → alert
Scheduler dead → no pings from anything → alert

In practice it's one extra line:

# before
0 2 * * * /usr/local/bin/backup.sh

# after
0 2 * * * /usr/local/bin/backup.sh && curl -fsS --retry 3 https://your-monitor/ping/abc123

The && means the curl only runs on success. -fsS keeps it quiet on network errors. --retry 3 handles blips.

What I use and what I've tried

A few tools that do this:

Healthchecks.io Open source, 20 free checks. Been around for years and it works. The UI is pretty dated but functionally it does what you need.
Cronitor More polished, lots of integrations. Pricing is $2/monitor + $5/user though, which gets expensive if you have 30+ jobs and a few team members.
DeadPing Full disclosure, this is something I'm building. I wanted something between Healthchecks' bare-bones UI and Cronitor's enterprise pricing. Clean dashboard, Slack/Discord/email, $14/mo for 50 monitors. Still in development waitlist is open if you want to try it.
Roll your own Totally doable for a handful of jobs:

from datetime import datetime, timedelta
from flask import Flask

app = Flask(__name__)
checks = {}

@app.route('/ping/<check_id>')
def ping(check_id):
    checks[check_id] = datetime.utcnow()
    return 'OK'

def check_overdue(check_id, expected_interval_minutes):
    last_ping = checks.get(check_id)
    if not last_ping:
        return True
    return datetime.utcnow() - last_ping > timedelta(minutes=expected_interval_minutes)

DIY works fine for 2-3 jobs. Past that, you're now maintaining a monitoring service that itself needs monitoring, and that's a rabbit hole.

Where to start

You don't have to instrument everything at once. Just pick the jobs where silent failure actually hurts:

Database backups if this is broken when you need it, you're in real trouble
Billing/payment runs missed invoices hit revenue directly
Data sync / ETL stale data means bad decisions
Cert renewals expired cert = site down
Disk cleanup full disk is a slow-motion disaster

Slap a dead man's switch ping on those five. Takes maybe 20 minutes total. Worth it.

What's the worst silent cron failure you've run into? Always curious to hear other people's war stories.

DEV Community