yureki_lab

Posted on Jun 15

How I Stopped Babysitting Claude Code: 5 Patterns for 24/7 AI Workers

#ai #claudecode #productivity #programming

TL;DR

I spent a month babysitting Claude Code runs — watching every prompt, every tool call, every "are you sure?" If I stepped away for an hour, things either silently stalled or did something I didn't want. Here are 5 patterns that finally got me to a place where my agent runs unattended for a full day and I'm not nervous about it. The big one is treating the agent like a flaky background job, not a chat partner.

The Problem

When I first started using Claude Code (this was on claude-sonnet-4-6, then claude-opus-4-7), I treated it like an ultra-fast pair programmer. I'd kick off a task, watch it tear through files, and step in whenever it got weird. That's fine for 20-minute sessions.

It falls apart fast when you want the agent to run for hours.

The specific failure mode I kept hitting: I'd start a long task, walk away for 30 minutes, and come back to one of three states:

Silent stall. A tool call was waiting on a permission prompt I never approved. Nothing was happening. The clock was burning.
Wrong direction. The agent had decided — totally reasonably from its point of view — to refactor a directory I didn't want touched. The diff was huge. Reverting was annoying.
Lost state. The session had been compressed or context-trimmed, and the agent had forgotten which step we were on. It cheerfully started over from a stale assumption.

None of these are bugs in Claude Code. They're consequences of treating an autonomous agent like an interactive REPL. I was the bottleneck because I was the supervisor. So I tried to stop being the supervisor.

This is the build log of how I did that. The goal: a Claude Code worker I can leave alone for a full workday, that either does the right thing or fails loudly enough that I don't waste a day of compute on garbage.

How I Solved It

Five patterns. Each one came out of a specific thing breaking. They compose.

1. Idempotent restart from a state file

The single biggest unlock. The agent should be able to re-read its own state file and pick up where it left off, without remembering anything from the previous session.

I keep a STATE.md in the repo root. It's plain markdown, three sections:

## Current goal
<one sentence>

## Next concrete action
<one specific verb-led step>

## Notes
<anything load-bearing that isn't in the code yet>

The agent reads this on every startup. Every time it finishes a meaningful step, it rewrites it. If the run dies mid-task, the next invocation reads the same file and resumes.

The key word is idempotent. The agent shouldn't care whether this is the first run, the fifth, or the recovery after a crash. The state file is the source of truth; the chat history is just working memory.

# pseudo-runner
while ! task_done; do
  claude --read STATE.md --do-next-action
done

What surprised me: once the state file existed, I stopped needing to scroll the chat to remember what we were doing. I just opened STATE.md. The discipline forced clarity.

2. Journal every external action

The second pattern: write an append-only log of anything the agent does that touches the outside world.

2026-06-10T09:14Z run-id=a3f committed=feat: add retry to fetcher
2026-06-10T09:18Z run-id=a3f pushed=origin/main
2026-06-10T09:22Z run-id=a3f opened-pr=#412

Pure text, one event per line, timestamp + run-id + what happened. No JSON, no nested structure, no schema migrations later. I keep this in a journal.log at the repo root and grep it when I'm trying to figure out what the agent did overnight.

The boring format is the feature. When something goes wrong at 3am and I want to know which commit was the agent's last sane state, I'm not parsing JSON in my head — I'm just reading lines.

This is also how I caught the "wrong direction" failures earlier. Reading two days of journal entries is way faster than diffing the repo.

3. Narrow tool grants by default, escalate by exception

I used to give my agent every tool. Read, Write, Bash, Web — all on, all the time. I figured "smart agent, smart enough not to do anything dumb."

It was dumb. Not because of the agent — because the blast radius of any single tool call was the entire filesystem and the entire internet.

Now I scope tools per task. A "review the diff" task gets Read + Grep. No Write. No Bash. A "run the test suite" task gets Bash but only with a whitelist of commands. A "post a comment to a PR" task gets the GitHub MCP server and that's it.

When the agent needs something not in scope, it stops and asks. I'd rather have it pause and request a tool than improvise around the missing one. The pause is recoverable. Improvising on a file you didn't want it to touch is not.

# Bad: agent has Bash with no constraints
# Good: agent has Bash restricted to: pytest, ruff, mypy

This is one of those things that sounds fussy until the first time it saves you from a rm -rf node_modules you didn't want.

4. An explicit error budget per run

Long-running agents fail. Sometimes a tool call times out. Sometimes a flaky test fails. Sometimes a fetch returns 502. The question isn't whether — it's how the agent reacts.

The old default: retry forever, gradually descend into nonsense. By turn 30, the agent is tilting at the windmill, convinced the problem is somehow with the test runner.

I now give every run a hard budget:

3 retries per tool call, then escalate
2 unrelated errors in a session, then halt and write a state-dump
No silent backoff loops

When the budget is blown, the agent writes the partial state to STATE.md, leaves a note in journal.log, and exits non-zero. The next scheduled run reads the state and either retries the same step (if I haven't intervened) or skips it (if I marked it dead).

This pattern is borrowed wholesale from how SREs run flaky services. AI agents are flaky services. Treat them like flaky services.

5. Dry-run wrappers for anything destructive

The last pattern is the cheapest and the one I should have started with. Anything that mutates external state — git push, posting to an API, sending an email — goes through a wrapper that supports a --dry-run flag.

def publish_post(payload, dry_run=False):
    if dry_run:
        print(f"[DRY] would POST {len(payload['body'])} chars to {URL}")
        return {"id": "dry-run", "url": "https://example.com/dry"}
    return real_post(payload)

The agent runs with DRY_RUN=1 by default during development. I flip it off when I trust the run. This means the entire pipeline is exercised end-to-end on every change, but nothing actually ships until I unlock it.

You'd think this is obvious. It was not obvious to me until I had the agent send three identical messages to a Slack channel because of a retry loop.

Lessons Learned

Five takeaways, written so I'd quote them at past-me.

The agent isn't the worker — the state file is. The chat session is ephemeral. Anything load-bearing has to live in a file the next invocation can read. Treat the agent as a stateless function over (STATE.md, repo).
Loud failures beat quiet ones, every single time. I used to tune my agent toward "don't crash." Wrong goal. The right goal is "crash visibly the moment the situation gets unfamiliar." A crash leaves a tombstone. A drift leaves a mess.
Tool surface = risk surface. Every tool you give the agent is a thing it can do without asking. Default to narrow. Expand by exception, with a reason. This is the same logic as Unix permissions and the same logic as IAM. It applies to agents too.
Budget the failure modes, not the work. Don't tell the agent "do this task." Tell it "do this task, but stop if you've retried 3 times or hit 2 unrelated errors." The work limit doesn't matter — the failure limit does.
Dry-run mode pays for itself in one incident. If you're shipping anything destructive, build the dry-run path first. Then make DRY_RUN=1 the default in your shell profile and forget about it.

The meta-lesson behind all five: stop thinking of the agent as a smarter you, and start thinking of it as an unsupervised intern with root. The interventions that work are the same ones you'd use on a junior engineer running their first on-call shift: a clear handoff doc, an append-only log of what they did, scoped permissions, a stop rule, and a sandbox for the scary stuff.

The agent is good at the work. It is not good at noticing it's lost. Your job is the noticing.

What's Next

Two things I'm playing with now:

Cross-run learning. Right now each run reads STATE.md and starts fresh. I want a longer-lived LESSONS.md the agent writes to when it discovers something durable about the codebase (a flaky test, an undocumented constraint). The next run reads it as context. Early experiments suggest this is a big lever — if the agent doesn't have to rediscover every quirk every session, runs get a lot cheaper.
A "second opinion" reviewer agent. Before any destructive step, a separate Claude instance with read-only access reviews the planned action and either approves or vetoes. This is overkill for small changes, but for the dangerous 5% of operations it's worth the extra tokens. I'm specifically interested in whether two independent reviewers catch more than one — the multi-perspective verify pattern in disguise.

If either of those goes somewhere interesting, I'll write it up.

Wrap-up / CTA

If you're using Claude Code and the run-time is starting to creep past "one focused session" — start with the state file pattern. It's the cheapest one on this list and it unblocks all the others. Everything else is a refinement.

If you found this useful:

Follow me on Dev.to — I'm writing up more build-in-public posts on running Claude Code in production
Try Claude Code if you haven't — it's free to get started and the agent-loop UX is genuinely different from a chat window
Drop a comment with the pattern you're using to keep your own agent runs honest. I'd love to steal it.

Versions used while writing this: Claude Code v0.x (current as of mid-2026), claude-sonnet-4-6 and claude-opus-4-7. The patterns aren't model-specific — they're about how you wrap the agent, not which model you point it at.

DEV Community