DEV Community

Shubhankar Mohan
Shubhankar Mohan

Posted on

Your Cron Jobs Are Failing Silently — Here's How to Catch Them

The problem nobody talks about

Your monitoring catches errors. It catches high latency, 500s, disk full, OOM kills. But what about the things that simply don't happen?

  • A cron job that should run every hour... just stops. No error. No log. Nothing.
  • A nightly ETL that should finish by 4am... never starts.
  • A data sync that usually happens every ~15 minutes... goes silent.

You find out days later when someone asks "why is this data stale?"
This is the dead man's switch problem. The term comes from train operators — a switch that must be actively held down, triggering an alarm if released.
The same concept applies to software: if an expected signal stops arriving, something is wrong.

Why existing tools don't solve this

Log-based alerts trigger on patterns they see. Error-rate alerts need errors to count. If a process doesn't run at all, there's nothing to alert on.

You could write a custom check for each job — "query the DB for last run timestamp, compare to now, alert if stale." But that's a new check per job, each with its own threshold logic, and none of them handle irregular schedules.

How Vigil works

I built Vigil to solve this generically. It sits alongside your Prometheus and Loki and asks one question:
"Did the expected signal arrive on time?"

Frequency mode

For signals on a known schedule. Tell Vigil: "this metric should update every 3600 seconds, with a 5-minute grace period." If the signal is late, Vigil transitions the switch from UP → GRACE → DOWN.

Irregularity mode

For signals that aren't on a fixed schedule. Vigil collects samples, computes the median interval, and alerts when the gap exceeds a configurable multiplier. No manual threshold tuning needed.

Alerting

When a switch changes state, Vigil can notify you directly via Slack, Discord, PagerDuty, Telegram, or any webhook. It also exposes Prometheus metrics (dms_switch_status) so you can use your existing Grafana alerting.

Getting started

docker pull shubhankarmohan/vigil:latest
docker run -d -p 8080:8080 -v vigil-data:/data shubhankarmohan/vigil:latest

Open localhost:8080, create a switch pointing at your Prometheus query or Loki log selector, and you're monitoring for silence.

What's next

The project is open source (MIT). I'd love feedback on:

  • What silent failures have bitten you?
  • Any detection modes you'd want beyond frequency/irregularity?

GitHub: https://github.com/shubhankar-mohan/Vigil

Vigil catches what your monitoring misses: silence.

Top comments (0)