We use Claude Code every day. It's excellent. It handles complex refactors, writes tests, navigates large codebases, and catches bugs we'd miss. But after months of running it as part of an autonomous multi-agent system, we noticed something: Claude Code is a powerful tool, but it's fundamentally passive. It waits for instructions, executes them, and stops. It doesn't monitor itself, plan ahead, review its own output, or remember what went wrong last time.
That's not a criticism — it's a design choice. Claude Code is built to be a coding assistant, not an autonomous engineering agent. The question we kept asking was: what would it take to turn it into one?
The answer became what we call the V2 architecture: a three-layer system that wraps Claude Code with monitoring, planning, and review. This post describes what we built and why each layer exists.
The passive tool problem
When you run Claude Code directly, the interaction model is: you give it a task, it works on it, it finishes (or gets stuck). There's no process watching whether it's still making progress, no structured plan it's executing against, and no second opinion on whether the output is actually correct.
In practice, this creates three failure modes we hit repeatedly:
Stuck loops. Claude Code sometimes gets into states where it's re-trying the same failing approach. Without external monitoring, the session just keeps running until you notice something is wrong — which, if you're running it autonomously overnight, might be hours later.
No upfront plan. For tasks with multiple steps or dependencies, jumping straight into code before having a clear implementation plan often leads to mid-task pivots that are expensive to recover from. The natural thing for a human engineer is to sketch the approach first. Claude Code doesn't do this by default.
No cross-review. A model reviewing its own output has blind spots — the same reasoning that produced a bug often produces a rationale for why the bug is fine. A second model with a different training distribution catches different things.
Each of these is solvable. Together they become the V2 architecture.
V2 architecture overview
The system has three layers that operate around every Claude Code session:
Layer 1: Heartbeat (watchdog)
└─ tmux session monitor
└─ detects stuck/crashed → auto-recover
Layer 2: Skill-Driven Dev (planning)
└─ SKILL.md written before code
└─ implementation blueprint
Layer 3: Dual Review (verification)
└─ Claude Code self-review
└─ Gemini CLI cross-review
Claude Code still does the actual coding. The layers don't replace it — they wrap it.
Layer 1: Heartbeat
Claude Code runs in a tmux session. The Heartbeat is a watchdog process that polls that session every 30 seconds and inspects the terminal output. It's looking for one thing: whether Claude Code's prompt is visible, which indicates it has finished and is waiting for input.
SOCKET="${TMPDIR:-/tmp}/openclaw-tmux-sockets/openclaw.sock"
LAST5=$(tmux -S "$SOCKET" capture-pane -p -J -t claude-code:0.0 -S -5)
echo "$LAST5" | grep -q "❯" && echo "DONE" || echo "RUNNING"
The ❯ is Claude Code's shell prompt. If it's visible, the session is idle. If it's not visible for longer than the timeout threshold, something is wrong.
When the Heartbeat detects a stuck session, it has three recovery strategies in order of escalation: send a gentle interrupt, close and restart the session with the same task context, or page the orchestrator (Orange) for human-in-the-loop intervention. Most stuck sessions resolve at step one.
This sounds simple, and it is. But without it, autonomous coding sessions are brittle. Claude Code gets stuck on network errors, permission issues, or loops where it convinces itself it's making progress when it isn't. The Heartbeat converts these from silent failures into handled exceptions.
Layer 2: Skill-driven development
Before any non-trivial task goes to Claude Code, the orchestrator writes a SKILL.md file. This is a structured implementation plan — the equivalent of a design doc — that Claude Code then executes against.
The skill file structure:
skills/
feature-name/
SKILL.md # implementation plan + steps
A typical SKILL.md has:
- Goal — what the task is and what done looks like
- Reference style — which existing code to model after
- Outline — the specific sections or steps, with the key technical details filled in
- Implementation steps — ordered list of what to do and in what sequence
- Privacy/safety checklist — things to verify before commit
This is the file you're reading right now in its original form — this blog post was written against its own SKILL.md.
The planning step forces precision before any code is written. Ambiguous tasks get clarified at planning time, not mid-implementation. It also gives Claude Code a success criterion to check against rather than having to infer when it's done.
The other benefit is reuse. Skills accumulate over time. When a similar task comes up again, the orchestrator can search the skill library for relevant patterns and adapt an existing plan rather than starting from scratch. Over time this is how the system builds institutional knowledge about how certain types of tasks should be approached.
Layer 3: Dual review
After Claude Code finishes and stages its changes, two reviews run before commit.
Claude Code self-review. The first review uses Claude Code itself — but in a separate session, reviewing the diff rather than the code it just wrote. This catches straightforward issues: leftover debug output, incomplete implementations, test files that test the wrong thing.
Gemini CLI cross-review. The second review pipes the staged diff to Gemini:
git diff --cached | gemini -p "Review this diff for: security issues, privacy leaks (IPs, emails, API keys), code quality. Output: PASSED or list of issues."
The cross-review is the more important one. A different model with a different training distribution reliably catches things Claude Code's self-review misses — particularly security issues and privacy leaks. We've had Gemini catch hardcoded test credentials, internal hostnames that shouldn't be in public code, and logic errors that Claude Code's self-review described as intentional design decisions.
The output format is strict: either PASSED or a list of issues. If there are issues, the commit is blocked and the problems are sent back to Claude Code for remediation. The loop continues until Gemini passes it.
The full V2 workflow
Putting it together, the orchestrator's AGENTS.md describes a fixed sequence for every coding task:
1. memory_search → find relevant lessons and patterns from past work
2. Write SKILL.md (plan before code)
3. Launch Claude Code via tmux (with Heartbeat active)
4. Wait for Heartbeat signal: DONE
5. Gemini Review on staged diff
6. If issues: send back to Claude Code, loop
7. Commit
8. Update lessons/MEMORY.md with what was learned
Steps 1 and 8 are what give the system memory. Before starting, the orchestrator searches its vector memory for lessons from similar past tasks — prior decisions, failure modes, patterns that worked. After finishing, it writes what it learned back to memory. Over time this creates a feedback loop where the system gets measurably better at certain types of tasks.
What this isn't
This isn't a fully autonomous engineering team. The orchestrator (Orange) still needs a human (Qiushi) to approve anything that touches production, involves financial operations, or represents a significant architectural decision. The V2 architecture automates the routine coding work; it doesn't automate judgment.
It's also not a replacement for Claude Code — it's a harness for it. The coding quality still comes from Claude Code. The architecture just ensures that quality gets checked, that sessions don't fail silently, and that the system accumulates knowledge rather than starting fresh every time.
The principle
The pattern here is one we've found ourselves returning to: AI tools are most powerful when they're not standalone, but when they're embedded in systems that monitor them, direct them, and check their output. Claude Code alone is a strong coder. Claude Code with a Heartbeat, a planning layer, and a cross-review step is closer to a reliable engineering workflow.
The same principle applies to any capable but passive AI tool. The tool does the work. The system ensures the work is worth keeping.
This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.
Top comments (0)