The problem nobody talks about
Your monitoring catches errors. It catches high latency, 500s, disk full, OOM kills. But what about the things that simply don't happen?
- A cron job that should run every hour... just stops. No error. No log. Nothing.
- A nightly ETL that should finish by 4am... never starts.
- A data sync that usually happens every ~15 minutes... goes silent.
You find out days later when someone asks "why is this data stale?"
This is the dead man's switch problem. The term comes from train operators — a switch that must be actively held down, triggering an alarm if released.
The same concept applies to software: if an expected signal stops arriving, something is wrong.
Why existing tools don't solve this
Log-based alerts trigger on patterns they see. Error-rate alerts need errors to count. If a process doesn't run at all, there's nothing to alert on.
You could write a custom check for each job — "query the DB for last run timestamp, compare to now, alert if stale." But that's a new check per job, each with its own threshold logic, and none of them handle irregular schedules.
How Vigil works
I built Vigil to solve this generically. It sits alongside your Prometheus and Loki and asks one question:
"Did the expected signal arrive on time?"
Frequency mode
For signals on a known schedule. Tell Vigil: "this metric should update every 3600 seconds, with a 5-minute grace period." If the signal is late, Vigil transitions the switch from UP → GRACE → DOWN.
Irregularity mode
For signals that aren't on a fixed schedule. Vigil collects samples, computes the median interval, and alerts when the gap exceeds a configurable multiplier. No manual threshold tuning needed.
Alerting
When a switch changes state, Vigil can notify you directly via Slack, Discord, PagerDuty, Telegram, or any webhook. It also exposes Prometheus metrics (dms_switch_status) so you can use your existing Grafana alerting.
Getting started
docker pull shubhankarmohan/vigil:latest
docker run -d -p 8080:8080 -v vigil-data:/data shubhankarmohan/vigil:latest
Open localhost:8080, create a switch pointing at your Prometheus query or Loki log selector, and you're monitoring for silence.
What's next
The project is open source (MIT). I'd love feedback on:
- What silent failures have bitten you?
- Any detection modes you'd want beyond frequency/irregularity?
GitHub: https://github.com/shubhankar-mohan/Vigil
Vigil catches what your monitoring misses: silence.
Top comments (0)