Ken Imoto

Posted on Jun 3 • Originally published at kenimoto.dev

I Cron-Scheduled 7 AI Agents. 2 Silently Failed for 18 Days. Tracing Wouldn't Have Caught It.

#claudecode #ai #devops #observability

I had seven AI agents on cron. Two of them stopped running on day one. I noticed on day eighteen.

That sentence is the whole article, and it is also the kind of sentence I would have argued against if someone else had said it on a podcast. "Surely you would notice. You have tracing. You have dashboards. You have a Telegram channel that lights up every time anything moves." Yes. I had all of those. The two dead agents still slipped under all of them, because every monitoring layer I had was built to watch processes that ran. Two of mine were not running.

This is the eighteen-day log: what the seven agents were, how two of them broke silently on day one, why my tracing was the wrong shape for the failure, and the small exit-code contract I now bolt onto every scheduled CLI agent.

The seven agents and a setup that looked fine

I run two content domains and one self-evolving harness on the same box. Each domain has three agents on a daily cron at 09:00 (observer, strategist, marketer), plus a shared evolver that runs on Saturdays. That is seven processes. The cron lines all looked roughly like this:

0 9 * * * /home/me/repos/harness-ops/scripts/marketer-A.sh >/dev/null 2>&1
0 9 * * * /home/me/repos/harness-ops/scripts/marketer-B.sh >/dev/null 2>&1

Each shell script wraps claude -p "..." with a prompt, captures the output, writes a daily log file, and ends. A real article gets pushed at the end if the agent decides to publish. I have a Telegram webhook that fires from inside the script on success and on set -e failure paths. I had been running this setup for about two months before the silent failures.

The thing I missed at setup time was three lines below the heredoc. The two marketer scripts referenced a Python helper that lived in a sibling repo. I had cd'd into that sibling repo at some point and tested everything by hand. The script worked. I checked it in. Then I tidied the sibling repo, renamed the helper module, and the import line in my marketer script started to point at nothing.

You can already see what happens. python3 helper.py ... exits with code 1 immediately on ModuleNotFoundError. The shell script's first line is set -euo pipefail, so the script dies in the first ten lines. Telegram is wired up later, after the Python call, so the script never reaches the Telegram block. The >/dev/null 2>&1 redirect swallows the stderr. Cron is configured without MAILTO=. Two agents die quietly every morning. The other five publish like normal. The system looks healthy.

What tracing was watching, and what it was not

I want to be precise here, because I spent a few hours on day eighteen convincing myself that better tracing would have caught it. It would not have.

I had OTEL spans coming out of every claude -p invocation. They went into a self-hosted collector and out to a small dashboard showing tokens per task, tool-call latency, retry rate, and daily total agent runs. On the morning of day eighteen the dashboard showed five agent runs per day, every day, for the last eighteen days. The line was perfectly flat. The line was supposed to be at seven.

Tracing instruments processes that execute. It can show you a slow call. It can show you a failed call. It can show you a retry storm. It cannot show you a process that was never spawned. The two dead marketers had no spans because the only span emitter was inside the very Python helper that was failing to import. From the dashboard's perspective, those two agents simply did not exist that day. And the next day. And the next.

This is the exact failure mode the healthchecks.io docs describe as a dead man's switch: a critical job stalls without tripping any alarm, and the silent failure persists for days before someone notices missing data. The 2026 write-ups on cron monitoring put it bluntly. Silent failures are the most expensive kind, because nothing crashes and nobody gets paged. I had read that page before. I just had not applied it to my own cron, because I had Telegram alerts and felt covered. Telegram only fires from code paths the script actually reaches.

The exit-code contract I bolted on

The fix did not involve more observability. It involved less trust in the agents to report themselves, and more trust in the cron wrapper to report on their behalf. I gave every scheduled agent a small contract.

Define exit codes that mean something. Not just 0 = good, anything else = bad. I cribbed loosely from sysexits.h: 0 is "agent ran and finished its task," 64 is "config or env error" (the ModuleNotFoundError case), 65 is "task ran but produced no usable output," 78 is "agent skipped on purpose" (the marketer decided there was nothing to publish today).
Make the cron wrapper own the reporting. The agent script's job is to exit with the right code. The wrapper's job is to pick up that code and push it somewhere durable, whether the agent itself succeeded or failed.
Add a heartbeat that fires on success. Not on failure. Silence has to be the alarm. This is the dead man's switch the monitoring tools are built around: the job must actively check in to prove it ran, and a missing ping is what pages you.

The cron wrapper now looks roughly like this:

#!/usr/bin/env bash
# scripts/cron-wrap.sh <agent-name>
set -uo pipefail
AGENT="$1"
SCRIPT="$HOME/repos/harness-ops/scripts/${AGENT}.sh"
HC_URL="https://hc-ping.com/<uuid-${AGENT}>"

START=$(date -Iseconds)
bash "$SCRIPT"
RC=$?
END=$(date -Iseconds)

# log every run, success or not
echo "${START} ${AGENT} rc=${RC} end=${END}" >> "$HOME/logs/cron-runs.log"

# ping the heartbeat URL with the exit code embedded in the path
# missing ping for 24h -> healthchecks.io pages me
curl -fsS --retry 3 "${HC_URL}/${RC}" >/dev/null || true

# escalate non-zero immediately, but never let cron itself fail
if [[ "$RC" -ne 0 && "$RC" -ne 78 ]]; then
  "$HOME/bin/tg-notify.sh" "agent=${AGENT} rc=${RC} see ~/logs/cron-runs.log"
fi
exit 0

Three things in there took me a couple of evenings to get right.

First, set -uo pipefail instead of set -euo pipefail. I do not want the wrapper to exit on the agent script's failure. If it exits before the ping, the heartbeat service will eventually page me, but the page arrives 24 hours late and the log line never gets written. The wrapper has to keep running and capture the exit code itself.

Second, the ping URL carries the exit code in the path. healthchecks.io accepts that and exposes it in the dashboard as the last reported code, so I can glance at a list and see "agent ran, exited 64" without opening the log file. Cronitor does the same with a slightly different URL shape; pick whichever fits your existing tooling.

Third, 78 is treated as a deliberate skip, not a failure. The marketer's "nothing worth publishing today" path returns 78. Without that, the failure escalation would fire on legitimately quiet days and I would learn to ignore the channel, which is how monitoring dies in practice.

What it caught the day I deployed it

I rolled this out on what was day eighteen for the silent marketers. Within ten minutes, both marketer-A and marketer-B showed up in the dashboard with last-reported exit code 64, the config error from the module that did not exist anymore. I had not opened the agent code yet. I just looked at the dashboard.

Within an hour I had renamed the import, run both scripts manually to verify exit 0, and the next morning's cron published the two articles those agents had been quietly skipping for two and a half weeks. The tracing dashboard finally went up to seven runs per day. The line is still flat, but it is flat at the right number now.

The day after, a different agent (observer-B, fine throughout the silent failure period) started exiting 65, "no usable output." The dashboard caught it inside twenty minutes. That is the kind of thing the contract is for: when the agent ran but produced garbage, you find out the same day, not the same fortnight.

What I would tell past-me

The version of me who set this cron up two months ago was not careless. He had Telegram alerts, a tracing dashboard, and a daily log file. He had read the Twelve-Factor App chapter on disposability. He had even thought about the difference between "agent failed" and "agent did not run," and decided the latter was unlikely enough to ignore.

The mistake was assuming "did not run" was a rare edge case. In a setup with seven scheduled processes, three Python helpers, two repos that move independently, and a long-running script that wires Telegram in the middle rather than at both ends, "did not run" is the single most likely silent failure mode. It is not even close. (I wrote up the surrounding scripts in a separate post on nine bugs in my AI pipeline; silent-cron was number seven on that list.)

So three things, in order of how cheap they are to set up.

MAILTO= is free. Set it, and cron itself emails you the stderr of any failing job, even the ones that die before your alerting code runs. That alone would have caught my failure the same morning it happened.
Wrap every scheduled agent in a script you own. Not the agent itself, but a wrapper around the agent with one job: capture the exit code and ping somewhere. The wrapper is allowed to be uglier than the agent because it should never change.
The success heartbeat is what makes silence loud. Failure alerts are everywhere, and they tell you nothing about agents that never executed. A heartbeat that fires on success, plus a dead man's switch that pages when the heartbeat goes missing, turns "two agents went quiet" from an eighteen-day discovery into a one-day discovery.

Tracing and observability are how you watch processes that are alive. An exit-code contract is how you remember they were supposed to be alive at all. The two complement each other, and the cron-based "set it and forget it" pattern collapses without the second one. Mine did, for eighteen days, quietly, on a server I checked every morning.

I checked the dashboards. The dashboards were just looking at the wrong question.

If you want to go deeper than a single blog post, I expanded the lifecycle-and-hooks layer of this story into a chapter of my book on AI agent harnessing: Harness Engineering Guide: From Tools to Compounding Productivity.

Top comments (2)

xulingfeng • Jun 4

We hit this exact pattern with our Hermes agent cron fleet. Ended up building a separate heartbeat checker that scrapes the log dir and flags any agent that hasnt written in 24h. Catches the never-started case that tracing misses by definition. Do you keep the >/dev/null 2>&1 redirect now or did you rip it out entirely?

Ken Imoto • Jun 13

Same failure mode, almost word for word. The log-dir scraper is the right call. "Hasn't written in 24h" is the only signal that survives a process that never spawns.

On the redirect: I kept it but repointed it. stdout/stderr go to a per-run log file now instead of /dev/null. The redirect was never the bug; throwing the bytes away was. The wrapper reads the exit code and tails that log on non-zero, so the ModuleNotFoundError lands in the alert instead of vanishing.

One gotcha with log-dir scraping: an agent that writes a partial log then dies still reads as "alive" to the heartbeat. I gate the daily ping on a success marker the agent writes last, not on any write at all.