Tijo Gaucher

Posted on May 18

[3 Reliability Patterns That Stopped My AI Agent From Crashing Every 6 Hours]

#ai #agents #automation #devops

I'm running five AI agents in production. None of them are the ambient "does your whole job" kind everyone's demoing on X. They're boring: a research agent that scrapes pricing data overnight, a cold-email agent that drafts and queues replies, a coding agent that triages GitHub issues, a screenshot/QA agent for our marketing pages, and one that just runs scheduled reports.

For the first month, every one of them died on a six-to-twelve hour cadence. Sometimes a tool call would hang forever. Sometimes the model would return a token the parser choked on. Sometimes it was just OOM. The agent would freeze, the cron-style triggers would queue up behind it, and I'd find out the next morning when nothing had run.

Three patterns took me from "babysit the process" to "haven't looked at the dashboard in a week." Here they are.

1. Treat the agent like any other long-running process: supervise it

The first instinct people have is to put their agent script in a while True: loop and call it good. Don't. The loop dies with the process — and the process dies more than you think.

I put every agent under supervisord (you can use systemd if you prefer; the point is the same). The config is maybe ten lines:

[program:research-agent]
command=/usr/bin/python /opt/agents/research.py
autostart=true
autorestart=true
startretries=10
stderr_logfile=/var/log/agents/research.err.log
stdout_logfile=/var/log/agents/research.out.log

Three things this gets you. First, restart on crash — autorestart=true brings the process back even when the Python interpreter exits non-zero. Second, logs that survive the crash, because supervisord captures stderr to a file the agent itself never opened. Third, a single command to see state — supervisorctl status tells you which agents are alive without grepping ps.

The number that mattered after I did this: my "agent uptime" went from 71% to 99.4% in a week. Nothing about the agent code changed. The whole win was running it like a real service instead of a script.

2. Persist state outside the process

Restart-on-crash is only useful if the agent doesn't lose its place when it comes back. The default for most agent frameworks is to hold the conversation history, the to-do list, and any cached tool outputs in memory — all of which vanish the moment the process dies.

Two layers of persistence cover almost every case I've hit:

Checkpoint after every tool call. Before the agent loop calls the next tool, write the current state (messages, pending tasks, partial outputs) to disk or SQLite. After a restart, the first thing the agent does is read the checkpoint and resume from the last completed step. The overhead is a few milliseconds per turn — nothing compared to the inference latency you're already paying.

Use a real queue for incoming work. If your agent runs on a schedule or responds to webhooks, don't have the trigger call the agent directly. Push the trigger onto a queue (Redis, SQS, or a database table with a claimed_at column) and have the agent pull from it. When the agent dies mid-task, the queue still has the job, and the next run picks it up.

These two together mean a crash costs you one redo of the last tool call — not a whole night of missed work.

3. Bound everything with timeouts (and don't trust the SDK defaults)

The single biggest source of "agent is alive but doing nothing" was tool calls that hung. A scraping target stops responding. A subprocess.run with shell=True never returns. The model SDK's "default" timeout turns out to be three minutes, and you didn't notice.

Wrap every tool the agent can call in a timeout. In Python:

import signal
from contextlib import contextmanager

@contextmanager
def hard_timeout(seconds):
    def _handler(signum, frame):
        raise TimeoutError(f"hit {seconds}s timeout")
    old = signal.signal(signal.SIGALRM, _handler)
    signal.alarm(seconds)
    try:
        yield
    finally:
        signal.alarm(0)
        signal.signal(signal.SIGALRM, old)

Then any tool call gets with hard_timeout(30): .... Pick the budget based on what the tool actually does — 5s for a curl, 30s for a scrape, 120s for a model call. The point is that something fires.

Pair this with a circuit breaker for tools that fail repeatedly. If the same tool times out three times in a row, mark it broken for ten minutes and route the agent to a fallback (or just have it skip and log). The alternative — letting the agent retry indefinitely on a broken endpoint — burns tokens and blocks every other task in the queue.

What changed

After all three patterns: 99.4% uptime across the five agents, average time-to-recovery on a crash dropped from "next morning when I noticed" to under 30 seconds, and token spend on retries dropped by about 40%. The agents got less smart in some sense — they no longer try heroic recovery — and that's the trade. Boring agents that run forever beat clever agents that occasionally do magic.

When you stop doing this yourself

I built all this on a $20/mo VPS I administer myself. It works. But by month three I noticed I was spending more time on the supervision layer than on the agents themselves — tweaking restart policies, rotating logs, sizing the checkpoint table, debugging why one agent's port forward broke when another one OOMed.

If you're in that loop and you'd rather not be — building your agent and operating your agent are different jobs — managed hosting takes the operational layer off your plate. RapidClaw's Builder Sandbox ($99/mo) is the same MicroVM-with-sudo setup I'm describing here, with the supervisor, checkpointing, and timeouts already wired up. The Dev Agent tier adds snapshot/rollback so a bad deploy doesn't take the whole agent down.

Either way, the patterns are the patterns. Whether you operate them yourself or let someone else run the agent host, supervise the process, persist the state, bound the calls. Your agents will outlive you.

Tijo Gaucher is the founder of RapidClaw, managed hosting for AI agents.

Top comments (1)

foxck016077 • May 18

The "agent script in a while True: loop is fine" instinct is the one that hurts most because it works for the first 72 hours, which is exactly the window where you decide you don't need supervisord. By the time the cron queue is silently piling up behind a frozen process, you've already shipped four more agents to it.

Two questions from the platform-hosted side of this same problem:

The supervisord / systemd pattern assumes you own the host. For agents running on Apify or similar platforms, the platform handles process supervision but the failure modes shift — you get OOM on a fresh container instead of a crashed loop, and the queue drift moves from your supervisor to the platform's scheduler. Do you still recommend running a thin self-supervised layer inside the container, or trust the platform's autorestart semantics? I shipped a long-running Apify actor recently and the bounded-queue + persisted-state patterns translated cleanly, but "supervise the process" sort of dissolves into platform config.
On bounded outbound calls — do you wrap with asyncio.wait_for plus per-tool budgets, or run an external watchdog process that kills the agent if the heartbeat stalls past N seconds? I keep going back and forth between "fewer moving parts" and "the watchdog is the one thing that always works."