Your Cron Jobs Are Failing Silently — Here's How to Catch Them

#monitoring #deadmanswitch #devops

The problem nobody talks about

Your monitoring catches errors. It catches high latency, 500s, disk full, OOM kills. But what about the things that simply don't happen?

A cron job that should run every hour... just stops. No error. No log. Nothing.
A nightly ETL that should finish by 4am... never starts.
A data sync that usually happens every ~15 minutes... goes silent.

You find out days later when someone asks "why is this data stale?"
This is the dead man's switch problem. The term comes from train operators — a switch that must be actively held down, triggering an alarm if released.
The same concept applies to software: if an expected signal stops arriving, something is wrong.

Why existing tools don't solve this

Log-based alerts trigger on patterns they see. Error-rate alerts need errors to count. If a process doesn't run at all, there's nothing to alert on.

You could write a custom check for each job — "query the DB for last run timestamp, compare to now, alert if stale." But that's a new check per job, each with its own threshold logic, and none of them handle irregular schedules.

How Vigil works

I built Vigil to solve this generically. It sits alongside your Prometheus and Loki and asks one question:
"Did the expected signal arrive on time?"

Frequency mode

For signals on a known schedule. Tell Vigil: "this metric should update every 3600 seconds, with a 5-minute grace period." If the signal is late, Vigil transitions the switch from UP → GRACE → DOWN.

Irregularity mode

For signals that aren't on a fixed schedule. Vigil collects samples, computes the median interval, and alerts when the gap exceeds a configurable multiplier. No manual threshold tuning needed.

Alerting

When a switch changes state, Vigil can notify you directly via Slack, Discord, PagerDuty, Telegram, or any webhook. It also exposes Prometheus metrics (dms_switch_status) so you can use your existing Grafana alerting.