Łukasz Maśląg for CronMonitor

Posted on Feb 3 • Originally published at cronmonitor.app

Incident Response for Failed Scheduled Tasks

Scheduled tasks are the silent workhorses of modern applications. They create backups, send reports, synchronize data, and keep the system running smoothly—until they stop working. When a cron job fails without notice, you might not notice it for days or even weeks.
I've experienced this myself many times. The last incident involved a GPS data synchronization failure between an external API and a client instance for several days. The task was configured, the server was running, but data stopped downloading for some stupid reason: the secret key had changed :(. This experience once again made me and the client realize that monitoring isn't optional—it's essential.

The Hidden Cost of Silent Failures

Failed scheduled tasks rarely announce themselves. Unlike a crashed web server that immediately frustrates users, a broken cron job just... stops. The damage accumulates quietly:

Data loss: Backups that never ran mean recovery becomes impossible
Stale information: Reports based on outdated data lead to bad decisions
Cascade failures: One missed sync can break downstream processes
Compliance issues: Regulatory requirements often mandate specific automated processes

Building an Effective Incident Response Plan

1. Detection: Know When Something Breaks

The first challenge is knowing a failure occurred. Traditional approaches have significant gaps:

Log monitoring catches errors only if the job runs and logs something. A job that never starts produces no logs to monitor.

Heartbeat monitoring flips this approach. Instead of watching for failures, you watch for success signals. If your job doesn't check in within its expected window, something's wrong.

# At the end of your cron job, send a heartbeat
curl -s https://cronmonitor.app/ping/abc123

This simple pattern catches both execution failures and jobs that never started.

2. Classification: Assess the Impact

Not all failures require the same response. Create a severity matrix:

Severity	Criteria	Response Time
Critical	Data loss risk, customer impact	Immediate
High	Business process affected	Within 1 hour
Medium	Degraded functionality	Within 4 hours
Low	Minor inconvenience	Next business day

Your backup jobs? Critical. That weekly analytics report? Probably medium.

3. Notification: Alert the Right People

Effective alerting means reaching the right person through the right channel:

Email works for low-priority, non-urgent issues
Slack/Discord suits team-wide visibility and collaborative debugging
Telegram offers a good balance of immediacy and unobtrusiveness

Avoid alert fatigue by:

Grouping related failures
Setting appropriate thresholds before alerting
Including actionable information in every alert

4. Response: Have a Playbook Ready

When alerts fire at 3 AM, you don't want to be figuring out what to do. Document your runbooks:

## Backup Job Failure Runbook

### Immediate Actions
1. Check if the job is currently running: `ps aux | grep backup`
2. Review recent logs: `tail -100 /var/log/backup.log`
3. Verify disk space: `df -h`
4. Check database connectivity

### Common Causes
- Disk full → Clear old files, expand storage
- Database locked → Check for long-running queries
- Network timeout → Verify connectivity to remote storage

### Escalation
If unresolved after 30 minutes, contact: [DBA on-call]

5. Recovery: Get Back to Normal

Once you've identified the problem:

Fix the immediate issue - Get the job running again
Verify the fix - Manually trigger the job, confirm success
Check for data gaps - Did you miss processing any data?
Backfill if needed - Run catch-up jobs for missed periods

6. Post-Mortem: Learn and Improve

Every incident is a learning opportunity. Document:

What happened and when
How it was detected
Root cause analysis
What fixed it
How to prevent recurrence

Practical Implementation Tips

Start Simple

You don't need enterprise tooling to monitor cron jobs effectively. Even a basic approach helps:

#!/bin/bash
# backup.sh

set -e  # Exit on any error

/usr/local/bin/do-backup.sh

# Only reached if backup succeeded
curl -fsS -m 10 --retry 5 https://cronmonitor.app/ping/abs

Add Context to Your Alerts

An alert saying "Job failed" is less useful than:

ALERT: Daily backup failed
Server: prod-db-1
Last success: 2026-01-05 02:00 UTC
Expected: Every day at 02:00 UTC
Logs: /var/log/backup/2026-01-06.log

Test Your Monitoring

Periodically verify your monitoring actually works:

Intentionally break a non-critical job
Confirm the alert fires
Confirm it reaches the right people
Confirm the runbook is accurate

Conclusion

Silent failures are preventable. With proper monitoring, clear escalation paths, and documented runbooks, you transform "we discovered it weeks later" into "we fixed it in minutes."

The investment in incident response pays off not when everything works, but when something inevitably breaks. And in distributed systems with dozens of scheduled tasks, something always breaks eventually.

DEV Community