My home server sends me a stream of scheduled reports over Telegram every morning — system health, data pulls, monitoring summaries, a wake-up alarm. Around forty-seven cron jobs, delivered through two separate messaging agents, deliberately independent so that one agent failing can't silence the mornings.
Today the mornings went silent anyway. Both agents. Nothing at 3 AM, nothing at 7, nothing at 8:30. My first assumption was the agents — they're the newest, most complicated part of the stack, and when two things fail at once you suspect the thing they share. I was right about the reasoning and wrong about the layer.
Neither agent was broken
Both gateway services were green — one had been running uninterrupted for a day, the other for longer. But the job telemetry told the real story: 695 failed job executions since 01:03, and every single one carried the same last line:
Temporary failure in name resolution
The machine had lost the internet. Not Wi-Fi flapping, not a DNS misconfiguration — even direct queries to 1.1.1.1 got nothing, and the LAN link never dropped. The router's upstream was simply dead, from 01:03 to 15:43. Fourteen and a half hours, spanning every scheduled send of the day.
So the redundancy I'd built was real — two agents, two codepaths, two processes — and also cosmetic. Both paths terminate at the same wall socket. I had redundancy at the application layer and a single point of failure one layer down, which is the kind of thing that's obvious the moment it bites and invisible every day before that.
Redundancy that shares a dependency is a single point of failure with extra steps. Worth auditing for: my two "independent" delivery agents shared the WAN, the DNS resolver, and the power strip.
The network came back. Nothing else did.
Here's the part that actually earned this post.
At 15:43 the internet returned. The every-minute and every-15-minute jobs recovered on their own — their next tick came, succeeded, done. By evening they were all green without anyone touching anything.
The daily jobs did not come back. The morning report, the data pulls, the summaries — each of them runs once a day, each had fired exactly once into a dead network, and each would not try again until tomorrow. Cron kept perfect time through the whole outage. It just doesn't keep score.
That's the property I'd never had to think about before: cron remembers nothing. A missed tick doesn't queue, doesn't retry, doesn't log a debt. It evaporates. cron is a metronome, not a ledger — and every recovery guide you'll read is about getting the daemon running again, not about the runs that vanished while it was already running fine.
So recovery was manual: work out which of the day's jobs had failed inside the window, rerun seven of them by hand — in dependency order, with their exact original command strings so the monitoring would attribute the reruns to the right jobs — and skip the 3 AM alarm, because an alarm at 7 PM isn't late, it's wrong.
Doing that once is fine. Planning to do it every time the ISP hiccups is choosing to be the retry mechanism yourself.
The fix: give cron a memory
The fix is a small reconciler that runs every 15 minutes and closes the loop cron leaves open. The idea is stolen from control loops everywhere: compare desired state (what the crontab says should have run by now today) against actual state (what the job telemetry says actually succeeded), and converge.
Every job here already runs through a telemetry wrapper that appends one JSON line per execution — script, start, end, exit code. That log, it turns out, is the missing ledger. The crontab is the promise; the event log is the record; the reconciler is just the diff:
for sched, script, cmd in sorted(due_today()):
runs = [e for e in events_today
if e["script"] == script and e["ts_start"] >= sched - 60]
if any(e["exit"] == 0 for e in runs):
continue # promise kept
rerun(cmd) # promise broken — run the verbatim crontab line, once
The design decisions matter more than the loop:
- Gate on the network, quietly. The first check is a TCP connect to the delivery endpoint. If it fails, the reconciler exits 0 and says nothing — "network down" is a state, not an incident. It will still be here in 15 minutes.
- Only catch up jobs that run once a day or less. Anything on an interval self-heals on its next tick; rerunning it is pure noise. Recovery effort should match the job's natural cadence.
- One attempt per job per day. If the rerun also fails, the job is genuinely broken and the existing failure watchdog owns it. A reconciler without an attempt cap is a retry storm with a nicer name.
- Rerun the verbatim crontab line — wrapper, redirects, everything. The telemetry keys runs by command; a rerun that doesn't match the original string shows up as a new mystery job instead of clearing the red one.
- Denylist the jobs where the time is the payload. The wake-up alarm doesn't get caught up. Late data is still data; a late alarm is a bug.
- Reuse the schedule parser the dashboard already has. I nearly wrote a second little cron-expression parser for this and stopped: two parsers means two opinions about what "should have run" means, drifting independently. This was every copy of a fact is a liability again, wearing a parser costume — the schedule semantics are a fact, and I almost gave them a second copy.
One accepted gap, named honestly in the code: the reconciler trusts exit codes. A job that exits 0 but fails to deliver its message is invisible to it. The exit code is the contract; scripts that lie about success are a different bug with a different fix.
What I'd carry to any scheduled system
- Cron remembers nothing. If a run matters, something other than cron has to remember that it should have happened. A schedule is a promise, not a record.
- Audit redundancy for shared dependencies. Two of anything that terminate in the same cable is one of that thing.
- Per-execution telemetry turns out to be the ledger. I added the wrapper months ago for a dashboard. It ended up being the thing that made automatic recovery possible — you can't reconcile against a record you never kept.
- Recovery must distinguish "late" from "wrong." Some payloads age gracefully; some are only valid at their scheduled moment. Encode that, or your catch-up mechanism becomes a noise generator.
Total build: about 150 lines, one evening, on top of telemetry that already existed. The next 14-hour outage ends with one Telegram message — network back, reran six missed jobs — instead of a silent morning and an evening of forensics.
Top comments (0)