I run Claude Code and Codex on long, multi-step tasks on an isolated machine and I kept hitting the same handful of issues:
The agent reports a task as done when the tests didn't actually pass and blames "prexisting bugs."
Context fills up and compaction makes the agent forget why it did something three steps back, which wastes tokens and creates downstream bugs.
One blocked task stalls the whole run.
I just wanted to leave my agent running without giving up control. Here's what I did about each:
Lying about tests: the build and test commands run outside the worker, so it can't claim success and skip the gate. On failure it reverts to a git checkpoint and retries with the failure context.
Compaction amnesia: each task runs in a fresh worker, so nothing drags through a long compaction cycle. A worker can still inspect prior work when it needs to.
Blocked tasks: the plan is a DAG, so one block doesn't stop everything. It keeps working on tasks that aren't downstream and asks me a focused question in Telegram.
Staying in control: Claude Code drafts the plan, Codex reviews it, and I approve it before anything runs. There's a git checkpoint before each task, and the whole execution trail is on disk: plans, prompts, stdout/stderr, attempts, checkpoints, lessons.
I packaged this into an open source tool, Here's the link to the repo: https://github.com/smithersbot/smithersbot.
I'm curious how others here handle the "agent is a bad witness of its own work" problem. Putting the test gate outside the worker is the only thing that reliably worked for me. What are you doing for that?
Top comments (0)