JessYT

Posted on May 25 • Originally published at jessinvestment.com

One bug cascaded into three failures across my LLM agent fleet. Here are the four guardrails I added.

#ai #python #automation #devops

I run a small fleet of LLM agents on a Mac mini at home — about 44 scheduled jobs that draft, review, and publish blog content without me in the loop. Most of them call the Claude CLI, drive a headless browser with Playwright, and post to a blog.

For months it just worked. Then one night a single bug took down three different parts of the system at once, and I learned that "it just works" is the most dangerous state an autonomous system can be in.

This is what broke, and the four guardrails I added so it can't break the same way again.

The cascade

On May 9 one of my publish jobs failed. Then it failed again. By the time I looked, it had failed four times and my login provider had temporarily blocked the account.

The root cause was embarrassingly small. The publish script matched a blog category by running a snippet in the page context:

page.evaluate(fn, arg1, arg2)   # three arguments

Playwright's evaluate() takes a function and at most one argument. Passing more throws at runtime. That alone was a one-line fix. But it triggered a chain:

The Playwright misuse made the publish step throw every time.
The retry logic dutifully tried again. Four rapid login attempts in a few minutes looked like an attack, so the login provider rate-limited the account.
The Telegram bot that relays my messages to the agent saved its last_update_id after processing each message. When the agent crashed mid-message, the offset was never committed — so on restart it re-read the same message and crashed again. An infinite loop.

One bug. Three independent failure modes. None of them had a guardrail.

The fix wasn't "be more careful." It was four layers, each catching a different class of failure before it can cascade.

Layer 1 — A static guard for the exact bug class

The Playwright bug was detectable without running anything. So I wrote an AST check that scans the publish infrastructure for evaluate() calls with too many arguments, and run it on every change to those files.

def check_evaluate_arity(tree: ast.AST, filename: str) -> list[str]:
    errors = []
    for node in ast.walk(tree):
        if isinstance(node, ast.Call):
            if isinstance(node.func, ast.Attribute) and node.func.attr == "evaluate":
                if len(node.args) > 2:   # fn + at most one arg
                    errors.append(
                        f"{filename}:{node.lineno} — "
                        f".evaluate() got {len(node.args)} args (max 2). "
                        f"Wrap multiple args in a dict."
                    )
    return errors

The same script also flags destructive shell strings (pkill -9, rm -rf) embedded in Python, because those should live in a wrapper script where a separate guard can see them — not hidden inside a subprocess.run.

The lesson: when a bug class is statically detectable, don't rely on catching it at runtime. A linter that knows your one specific failure is worth more than a generic test suite.

Layer 2 — A rate-limit circuit breaker

The account block happened because nothing tracked how often I was hitting the login. So I added a circuit breaker: at most two login attempts per 15 minutes, then a 30-minute cooldown.

attempts = [t for t in history.get(blog, []) if now - t < WINDOW]  # 15 min
if len(attempts) >= MAX_ATTEMPTS:                                   # 2
    cooldown_remaining = COOLDOWN - (now - min(attempts))           # 30 min
    if cooldown_remaining > 0:
        notify_alert(f"login rate limit: {len(attempts)}/15min — "
                     f"{int(cooldown_remaining/60)}min cooldown")
        return False   # refuse, don't attempt

The key design choice is that the breaker refuses the action rather than just logging a warning. A warning nobody reads is not a guardrail. And it fails open on its own errors — if the breaker itself throws, it allows the attempt, because a broken safety check shouldn't take down the whole system.

Layer 3 — Idempotency and a restart guard

The infinite loop had two causes, so it needed two fixes.

First, commit the offset before doing the work, not after:

updates = get_updates(offset=last_update_id + 1)
if updates:
    last_update_id = max(u["update_id"] for u in updates)
    save_state()          # commit BEFORE processing
for u in updates:
    process(u)            # if this crashes, the message is not re-read

This is just idempotency: a message that has been seen should never be processed twice, even if processing it crashes. Moving one line above the work loop turned an infinite loop into a single dropped message — a far better failure.

Second, a restart guard. If the process restarts four times in one minute, something is wrong and restarting faster won't help. So it backs off:

history = [t for t in history if now - t < 60]
history.append(now)
if len(history) >= 4:
    notify_alert("rapid-restart detected — 1h cool-off")
    time.sleep(3600)      # stop flailing, let a human look

Crash-loops are how a small failure becomes a 2-day outage. The guard turns "restart forever" into "restart, notice, wait."

Layer 4 — Fail fast, never retry

The most counterintuitive lesson: for this system, retrying is the dangerous path. The retries are what got the account blocked. So the publish step now aborts on the first failure, preserves its work, and tells me.

def notify_publish_failure(blog, html_path, reason):
    """One failure = abort. No retry — repeated attempts cause the block."""
    send_alert(
        f"publish failed — aborting after 1 attempt (no retry)\n"
        f"blog: {blog}\nfile: {Path(html_path).name}\nreason: {reason}\n"
        f"HTML preserved. Publish manually after review, or wait for the next run."
    )

Automatic retry assumes failures are independent and transient. In a system that talks to a rate-limited third party, failures are correlated — the second attempt is more likely to fail and more likely to do harm. The draft is preserved, a human decides, and a daily job will come around again anyway.

Bonus — Letting the system find its own patterns

Once a week, a job feeds the last seven days of error logs and a running incident table to an LLM and asks one question: which failures recurred at least twice? Single transient blips and "expected" failures are filtered out. Only repeating patterns get written to a draft file for me to review.

It does not edit the incident log directly — it only proposes. I keep the review gate, because an automated system rewriting its own postmortems is exactly the kind of clever idea that produces the next incident.

Guardrails are never done

A week after I shipped all this, the fleet broke again — for a completely different reason. A Homebrew Python upgrade changed the interpreter's signature, macOS revoked its file-access permission, and a dozen jobs started failing with a misleading ModuleNotFoundError. And while I was fixing that, I managed to start three copies of the same bot, which then fought each other for the same message queue.

None of the four layers caught it, because it was a new class of failure. That is the actual lesson. You don't write guardrails to reach "done." You write them so that each incident can only happen once — and then you go collect the next one.

I run these agents to publish content while I work a full-time backend job. The full incident archive and guardrail code live in my notes; this is the part I think generalizes. Happy to compare notes if you operate something similar.

Original with full infographics and visual structure: https://jessinvestment.com/one-bug-cascaded-into-three-failures-across-my-llm-agent-fleet-here-are-the-four-guardrails-i-added/

Top comments (1)

JessYT • May 25

Author here. This is from running ~40 scheduled LLM agents (mostly Claude Code) for my own blog automation over the past year. The cascade was one category-handling bug → a retry storm that got my login rate-limited → a bot restart loop, all from a single change. The four guardrails I added: an AST regression check on the publish code, a login rate-limiter, a restart cool-off, and fail-fast abort (no blind retries). Happy to dig into any of them — curious how others contain retry storms in agent fleets.