Unattended Daily Health Checks That Catch the Silent Failures

#monitoring #devops #automation #python

We don't just talk about automation -- we run on it. This is the system that probes our own
live products three times a day, blocks on real failures, and sends a report it has
fact-checked against live state first. We build what we use.

The problem

Things break quietly. A payment page silently stops loading the checkout. A scheduled job dies
and nobody notices for weeks. An automated report keeps sending -- but with stale, hardcoded
numbers from a template nobody updated. By the time a human catches it, the damage is days old.
Manual spot-checks do not scale and do not run at 3am.

The danger is not loud failure. It is confident, silent failure: a system that keeps reporting
"fine" while it quietly breaks.

The workflow

[ Scheduled checks ] -> [ Pass / fail gate ] -> [ Self-review ] -> [ Report to phone ]
  3x daily               live HTTP + regress     strip stale/wrong   pass / fail

1. A full check, three times a day

A combined health check runs across multiple live products. Operational checks -- live HTTP
reachability, a payment regression assertion, render, and a deployment-secret-leak scan -- are
pass/fail and block. Content-lint findings are collected as a non-blocking backlog, so noise
never masks a real outage.

2. A regression guard on the thing that makes money

One check specifically asserts that the checkout page still includes the payment-gateway domains
it needs. This is a direct guard against a real incident we had, where a security header silently
broke the live checkout. Now that exact failure is caught automatically on every run instead of
being discovered by a customer.

3. Hunting silent failures

A dedicated sweep found and killed 19 dead scheduled tasks, a reporting job that always
reported revenue as zero because of a module-alias mismatch, and a promo job crashing silently on
a byte-order-mark. The shared lesson -- a bare except hides exactly these failures -- is baked
into the checks rather than learned again next quarter.

4. Reports that self-review before sending

A real recurring problem was the auto-report going out with stale template data and leftover
placeholders. We added a final self-check pass that runs against a live measured snapshot and
strips any figure that contradicts reality, then localizes the text and removes placeholders --
all before the message is sent. A leftover hardcoded send-script that had been re-firing old
data was caught and removed in the process.

The result

Live products probed 3x daily, unattended, with hard failures gating instead of slipping through.
A payment-breaking regression class is now caught automatically on every run.
A batch of silent failures -- 19 dead tasks, a permanently-zero revenue report, a silently-crashing job -- found and fixed in a single health sweep.
Outgoing reports are accuracy-checked against live data before delivery, so the decision-maker is not fed confident-but-wrong numbers.

Stack

A scheduled health-check runner (3x daily) - live HTTP probes - a payment-config regression
assertion - a deployment secret-leak scanner - an LLM self-review pass against a live snapshot -
chat delivery - a task scheduler with a server-side cron mirror.

The takeaway

Automation without monitoring is a liability -- it fails silently and confidently. The two ideas
here are the ones teams skip: a regression guard on the thing that actually makes you money
(checkout), and a report that fact-checks itself against live state before it reaches a
decision-maker. That is the difference between "we have automation" and "we can trust our
automation."

We build automation systems like this for businesses drowning in repetitive busywork --
content, reporting, customer replies, lead follow-up. If a daily task is eating your team's
hours, that's usually a one-time build away from running itself.

DEV Community