Qiushi

Posted on Mar 10 • Originally published at claw-stack.com

From Code Completion to Code Team: How We Turned Claude Code into an Engineering Department

#agents #coding #openclaw #claudecode

We use Claude Code every day. It's excellent. It handles complex refactors, writes tests, navigates large codebases, and catches bugs we'd miss. But after months of running it as part of an autonomous multi-agent system, we noticed something: Claude Code is a powerful tool, but it's fundamentally passive. It waits for instructions, executes them, and stops. It doesn't monitor itself, plan ahead, review its own output, or remember what went wrong last time.

That's not a criticism — it's a design choice. Claude Code is built to be a coding assistant, not an autonomous engineering agent. The question we kept asking was: what would it take to turn it into one?

The answer became what we call the V2 architecture: a three-layer system that wraps Claude Code with monitoring, planning, and review. This post describes what we built and why each layer exists.

The passive tool problem

When you run Claude Code directly, the interaction model is: you give it a task, it works on it, it finishes (or gets stuck). There's no process watching whether it's still making progress, no structured plan it's executing against, and no second opinion on whether the output is actually correct.

In practice, this creates three failure modes we hit repeatedly:

Stuck loops. Claude Code sometimes gets into states where it's re-trying the same failing approach. Without external monitoring, the session just keeps running until you notice something is wrong — which, if you're running it autonomously overnight, might be hours later.

No upfront plan. For tasks with multiple steps or dependencies, jumping straight into code before having a clear implementation plan often leads to mid-task pivots that are expensive to recover from. The natural thing for a human engineer is to sketch the approach first. Claude Code doesn't do this by default.

No cross-review. A model reviewing its own output has blind spots — the same reasoning that produced a bug often produces a rationale for why the bug is fine. A second model with a different training distribution catches different things.

Each of these is solvable. Together they become the V2 architecture.

V2 architecture overview

The system has three layers that operate around every Claude Code session:

Layer 1: Heartbeat (watchdog)
  └─ tmux session monitor
  └─ detects stuck/crashed → auto-recover

Layer 2: Skill-Driven Dev (planning)
  └─ SKILL.md written before code
  └─ implementation blueprint

Layer 3: Dual Review (verification)
  └─ Claude Code self-review
  └─ Gemini CLI cross-review

Claude Code still does the actual coding. The layers don't replace it — they wrap it.

Layer 1: Heartbeat

Claude Code runs in a tmux session. The Heartbeat is a watchdog process that polls that session every 30 seconds and inspects the terminal output. It's looking for one thing: whether Claude Code's prompt is visible, which indicates it has finished and is waiting for input.

SOCKET="${TMPDIR:-/tmp}/openclaw-tmux-sockets/openclaw.sock"
LAST5=$(tmux -S "$SOCKET" capture-pane -p -J -t claude-code:0.0 -S -5)
echo "$LAST5" | grep -q "❯" && echo "DONE" || echo "RUNNING"

The ❯ is Claude Code's shell prompt. If it's visible, the session is idle. If it's not visible for longer than the timeout threshold, something is wrong.

When the Heartbeat detects a stuck session, it has three recovery strategies in order of escalation: send a gentle interrupt, close and restart the session with the same task context, or page the orchestrator (Orange) for human-in-the-loop intervention. Most stuck sessions resolve at step one.

This sounds simple, and it is. But without it, autonomous coding sessions are brittle. Claude Code gets stuck on network errors, permission issues, or loops where it convinces itself it's making progress when it isn't. The Heartbeat converts these from silent failures into handled exceptions.

Layer 2: Skill-driven development

Before any non-trivial task goes to Claude Code, the orchestrator writes a SKILL.md file. This is a structured implementation plan — the equivalent of a design doc — that Claude Code then executes against.

The skill file structure:

skills/
  feature-name/
    SKILL.md    # implementation plan + steps

A typical SKILL.md has:

Goal — what the task is and what done looks like
Reference style — which existing code to model after
Outline — the specific sections or steps, with the key technical details filled in
Implementation steps — ordered list of what to do and in what sequence
Privacy/safety checklist — things to verify before commit

This is the file you're reading right now in its original form — this blog post was written against its own SKILL.md.

The planning step forces precision before any code is written. Ambiguous tasks get clarified at planning time, not mid-implementation. It also gives Claude Code a success criterion to check against rather than having to infer when it's done.

The other benefit is reuse. Skills accumulate over time. When a similar task comes up again, the orchestrator can search the skill library for relevant patterns and adapt an existing plan rather than starting from scratch. Over time this is how the system builds institutional knowledge about how certain types of tasks should be approached.

Layer 3: Dual review

After Claude Code finishes and stages its changes, two reviews run before commit.

Claude Code self-review. The first review uses Claude Code itself — but in a separate session, reviewing the diff rather than the code it just wrote. This catches straightforward issues: leftover debug output, incomplete implementations, test files that test the wrong thing.

Gemini CLI cross-review. The second review pipes the staged diff to Gemini:

git diff --cached | gemini -p "Review this diff for: security issues, privacy leaks (IPs, emails, API keys), code quality. Output: PASSED or list of issues."

The cross-review is the more important one. A different model with a different training distribution reliably catches things Claude Code's self-review misses — particularly security issues and privacy leaks. We've had Gemini catch hardcoded test credentials, internal hostnames that shouldn't be in public code, and logic errors that Claude Code's self-review described as intentional design decisions.

The output format is strict: either PASSED or a list of issues. If there are issues, the commit is blocked and the problems are sent back to Claude Code for remediation. The loop continues until Gemini passes it.

The full V2 workflow

Putting it together, the orchestrator's AGENTS.md describes a fixed sequence for every coding task:

1. memory_search → find relevant lessons and patterns from past work
2. Write SKILL.md (plan before code)
3. Launch Claude Code via tmux (with Heartbeat active)
4. Wait for Heartbeat signal: DONE
5. Gemini Review on staged diff
6. If issues: send back to Claude Code, loop
7. Commit
8. Update lessons/MEMORY.md with what was learned

Steps 1 and 8 are what give the system memory. Before starting, the orchestrator searches its vector memory for lessons from similar past tasks — prior decisions, failure modes, patterns that worked. After finishing, it writes what it learned back to memory. Over time this creates a feedback loop where the system gets measurably better at certain types of tasks.

What this isn't

This isn't a fully autonomous engineering team. The orchestrator (Orange) still needs a human (Qiushi) to approve anything that touches production, involves financial operations, or represents a significant architectural decision. The V2 architecture automates the routine coding work; it doesn't automate judgment.

It's also not a replacement for Claude Code — it's a harness for it. The coding quality still comes from Claude Code. The architecture just ensures that quality gets checked, that sessions don't fail silently, and that the system accumulates knowledge rather than starting fresh every time.

The principle

The pattern here is one we've found ourselves returning to: AI tools are most powerful when they're not standalone, but when they're embedded in systems that monitor them, direct them, and check their output. Claude Code alone is a strong coder. Claude Code with a Heartbeat, a planning layer, and a cross-review step is closer to a reliable engineering workflow.

The same principle applies to any capable but passive AI tool. The tool does the work. The system ensures the work is worth keeping.

This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.

Top comments (1)

Henry Godnick • Mar 15

The stuck loops problem is real. We ran into the same thing running overnight Claude Code sessions - come back in the morning to find it's been looping on the same test failure for 6 hours. The heartbeat monitor approach is elegant.

One thing I'd add to the cost side of this: running Claude Code through multiple layers like this burns tokens fast, especially with the dual review loop. Having live visibility into token spend per session was critical for us to identify which tasks were genuinely complex vs which were just inefficiently prompted. Turns out the debugging loops were the real budget killers, not the initial code generation. A real-time menu bar counter that tracks usage across providers helped us spot those runaway sessions before they got expensive.

Bookmarking this architecture. The SKILL.md pattern alone is worth adopting.