WonderLab

Posted on May 22

Building Reliable AI Agents: Harness Engineering and Multi-Agent Architecture in Practice

#agents #harness #multiagent #claude

The Problems We Actually Ran Into

If you've built an AI-assisted analysis tool, you've probably hit these two walls:

Wall #1: Inconsistent output quality. The longer the task chain, the more the AI drifts — its language stays precise, its tone stays confident, but the conclusions don't hold up. Ask it "are you sure?" and it'll double down with even more conviction.

Wall #2: Token costs keep climbing. The more history you accumulate, the more you have to re-feed the model on every new session. Token consumption grows linearly. Analysis quality doesn't follow.

These are real problems we encountered building a CarPlay bug analysis tool. AI performance on multi-project, long-chain bug analysis was wildly inconsistent. When we introduced a multi-agent architecture to fix that, token consumption and runtime shot up instead.

Those two problems pushed us to find a systematic engineering solution.

Part 1: Harness Engineering — Putting a Leash on the Model

The Problem: Non-determinism Is an Agent's Original Sin

LLMs are inherently probabilistic. In a single-turn conversation, that randomness is what makes them creative. In a long-chain task, it's what makes them dangerous.

A typical failure scenario: you ask an Agent to complete a 10-step development task. Steps 1–7 go fine. On step 8, the Agent drifts slightly. Step 9 builds on the drift. By step 10, you receive a result that looks complete but is entirely off-target. And you almost don't notice — because the Agent wrote a very convincing summary.

When AI starts executing autonomously across many steps, the central engineering challenge becomes: how do you supervise it and course-correct before the damage is done?

The Solution: Agent = Model + Harness

In February 2026, Martin Fowler's team (author: Birgitta Böckeler) introduced the concept of Harness Engineering:

Agent = Model + Harness

Harness is everything in an Agent that isn't the model itself — prompts, tool definitions, rules, context management, validation mechanisms, feedback loops. All of it is the Harness.

This definition sounds unremarkable at first. But it carries an important shift in thinking: to improve Agent reliability, don't swap in a better model — design a better Harness.

A Harness has two components:

Guides (feedforward): Give the AI the right inputs before it acts — clear instructions, relevant context, structured task descriptions
Sensors (feedback): Validate the AI's outputs after it acts — independent validators, quality evaluators, anomaly detectors

LangChain's experiment makes this concrete: without changing the underlying model, using Harness Engineering alone, their Agent's benchmark ranking jumped from outside the top 30 to top 5.

The Goal: Fix Problems Before They Reach Human Eyes

One sentence captures what Harness Engineering is for:

To make AI Coding Agents work with less human supervision, you need to systematically build an "external control framework" — the Harness. It's composed of feedforward Guides and feedback Sensors, with the goal of automatically correcting problems before they ever reach a human reviewer.

Part 2: Two Failure Modes That Break Single Agents

With the framework established, let's look at exactly where single agents fail. Anthropic's article Harness Design for Long-Running Application Development identifies two core failure modes:

Failure Mode 1: Context Anxiety

As a task gets long and the Agent approaches its context window limit, it starts to "panic" and rush to finish — marking incomplete work as done, writing vague analyses with false certainty.

This isn't a bug. It's the model using "confident closure" as a coping mechanism for "uncertain continuation."

Fix: Don't use Compact (context compression). Use Context Reset — completely clear the context, start a new Agent with a structured handoff document, and let the new Agent take over.

Failure Mode 2: Self-Evaluation Breakdown

Ask an Agent to evaluate its own output and it becomes pathologically optimistic — it'll give itself high marks even when the work is poor, because it's evaluating using the same cognitive framework that produced the output.

It's like asking someone to take an exam and grade it themselves. Almost guaranteed to score well.

Fix: Introduce an independent Evaluator Agent, deliberately prompted to be critical and skeptical, with no shared context with the Generator.

These two discoveries directly motivate the first and most fundamental multi-agent pattern.

Part 3: Multi-Agent Architecture — Five Coordination Patterns

Some teams choose a pattern based on how sophisticated it sounds, not on whether it fits the problem at hand. Start with the simplest pattern that might work, see where it struggles, then evolve from there.

Pattern 1: Generator-Validator

Best for: Tasks where output quality is critical and evaluation criteria can be stated explicitly.

How it works: Generator produces output → Validator evaluates against explicit criteria → if rejected, Generator gets specific feedback → loop until accepted or max iterations reached.

Generator → Output → Validator ──pass──→ Done
                          │
                     fail + feedback
                          │
                          └─→ Generator (next round)

Typical use cases:

Code generation (Generator writes code, Validator writes and runs tests)
Customer support replies (Validator checks accuracy, tone, completeness)
Compliance review (Validator checks output against rules line by line)

Critical caveat: The Validator must have specific, explicit criteria — not "check if it's good." A Validator without concrete standards will just rubber-stamp the Generator's output. Also set a maximum iteration count with a fallback strategy to prevent infinite oscillation.

Pattern 2: Orchestrator-Subagent

Best for: Tasks that decompose cleanly into independent subtasks with minimal interdependencies.

How it works: Orchestrator handles global planning and task delegation. Subagents each own a specific responsibility and report results back. Orchestrator integrates and produces final output.

Orchestrator ──delegate──→ Subagent A (security check)
             ──delegate──→ Subagent B (code style)
             ──delegate──→ Subagent C (test coverage)
                   ←──── collect all results ────
                   → integrate into final report

Claude Code uses this pattern: the main Agent handles the primary workflow while dispatching subagents in the background to search codebases or investigate independent questions — keeping the Orchestrator's context focused on the main task while parallel work happens elsewhere.

Limitation: The Orchestrator is an information bottleneck. Information discovered by one subagent that's relevant to another must route through the Orchestrator. Key details get lost or over-summarized after a few hops.

Pattern 3: Agent Team

Best for: Tasks that decompose into long-running, independent subtasks where each worker benefits from accumulating domain context over time.

How it works: Coordinator distributes work via a shared queue. Multiple Workers each pick up tasks, run autonomously through multi-step work, and signal on completion. Unlike Pattern 2, Workers persist across tasks — they keep accumulating context rather than starting fresh each time.

Coordinator ──→ Task queue
Worker A ←── pick up task ──→ complete ──→ signal
Worker B ←── pick up task ──→ complete ──→ signal
Worker C ←── pick up task ──→ complete ──→ signal
                                ↓
              Coordinator collects → integration tests

Typical use case: Migrating a large codebase from one framework to another, with each Worker independently migrating one service.

Limitation: Independence is a prerequisite. Workers can't easily share intermediate findings. Careful task partitioning and conflict resolution mechanisms are required.

Pattern 4: Message Bus

Best for: Event-driven pipelines where the agent ecosystem is expected to keep growing.

How it works: Agents communicate through publish/subscribe events, decoupled from each other. A router delivers matching messages. New agents can join by subscribing to topics without modifying existing connections.

Alert sources → Triage Agent → Router
                                ├──→ Network Investigation Agent
                                ├──→ Identity Analysis Agent
                                └──→ Context Enrichment Agent
                                            ↓
                                Response Coordination Agent → Actions

Best fit when: Work is triggered by events rather than a predetermined sequence, and teams need to develop and deploy individual agents independently.

Limitation: The longer the event chain, the harder debugging becomes. A misrouted message causes silent failures — the system doesn't crash, it just doesn't process the event.

Pattern 5: Shared State

Best for: Collaborative tasks where agents need to build on each other's discoveries, without a central coordinator.

How it works: Agents run autonomously, coordinating through a shared persistent store (database, filesystem, or document). No central Orchestrator. Each agent reads what others have written, takes action based on those findings, and writes its own discoveries back.

Agent A (academic literature) ─┐
Agent B (industry reports)    ─┤──→ Shared knowledge store
Agent C (patent filings)      ─┘    ↑ agents read each other's findings
                                    → iteratively deepen research

Limitation: Without explicit coordination, agents may duplicate work. The hardest failure mode is the reactive loop: Agent A writes a finding → Agent B responds → Agent A reacts again — burning tokens indefinitely. Termination conditions must be first-class citizens: time budgets, convergence thresholds (no new findings after N cycles), or a dedicated "am I done?" judge agent.

Part 4: Solving the "Goldfish Memory" Problem

The Problem: The Context Tax

Claude Code has a fundamental limitation: it's stateless. Close the conversation window, memory resets.

Every new session, you have to re-explain everything — project architecture, past decisions, coding style preferences, bugs you've already ruled out. You're paying a context tax on every session just to get the AI back up to speed.

Worse: this repeated loading is billed by token. You're paying for compute that produces zero new value.

Solution 1: Claude Code's Native Memory (CLAUDE.md + Auto Memory)

Claude Code offers two mechanisms for carrying knowledge across sessions:

	CLAUDE.md Files	Auto Memory
Written by	You	Claude automatically
Contains	Instructions and rules	Learned patterns and preferences
Best for	Coding standards, architecture, workflows	Build commands, debugging insights, behavior preferences
Loaded each session	First 200 lines	First 200 lines

CLAUDE.md placement determines scope, from most to least specific:

.claude/CLAUDE.md (project-level, shared via version control)
~/.claude/CLAUDE.md (user-level, applies across all projects)

For larger projects, split rules into .claude/rules/ with one file per topic. Rules can also be path-scoped using YAML frontmatter — only loaded when Claude is working on matching files:

---
paths:
  - "src/api/**/*.ts"
---
# API Development Rules
- All endpoints must include input validation
- Use the standard error response format

Path-scoped rules reduce context noise — the relevant rules load when relevant, and stay out of the way otherwise.

Auto Memory stores Claude's self-generated notes at ~/.claude/projects/<project>/memory/:

memory/
├── MEMORY.md          # Index file, loaded every session
├── debugging.md       # Detailed debugging notes
├── api-conventions.md # API design decisions
└── ...

Run /init to auto-generate a starter CLAUDE.md. Claude will update Auto Memory over time based on your corrections and preferences.

Solution 2: claude-mem — The Community's Context Tax Workaround

The native solution is reactive (you correct → Claude updates). The open-source project claude-mem is more aggressive:

npx claude-mem install

Core mechanism: Attach a local memory store outside Claude Code. Hooks intercept every tool call, compress the interaction into a summary stored in SQLite, and on the next session inject only the semantically relevant history — not everything.

Data flow:

Tool call → PostToolUse Hook captures
         → Claude API call (Observer role)
         → Compresses to XML-format observation
         → Stored in SQLite + Chroma vector DB
         ↓
New session → SessionStart
         → Query last 50 observations + 10 summaries
         → Inject into Claude context
         ↓
User submits prompt → UserPromptSubmit
         → Semantic search for top 5 relevant observations
         → Precision-inject relevant history

Reported result from the project itself: 95% token reduction — 6 observations consumed 2,911 tokens to deliver work that would have taken 56,291 tokens with full context re-loading.

The Observer uses a structured prompt to produce parseable XML:

<observation>
  <type>bugfix</type>
  <title>CarPlay startup disconnect root cause identified</title>
  <narrative>Root cause was IOKit initialization timing. Fix was to...</narrative>
  <facts>
    <fact>Disconnect occurs 200ms after kIOMessageServiceIsTerminated event</fact>
    <fact>CarPlay framework begins handshake before driver init completes</fact>
  </facts>
</observation>

Part 5: Teaching the AI Your Habits — Continuous Learning

The Problem: AI Has No Muscle Memory

You've used Claude Code for three months. It still doesn't know your coding style. Every new session, it's a new employee who knows nothing about you — doesn't know you prefer functional over OOP, doesn't remember your project's quirky conventions, has no record of how you solved a similar problem last week.

The Solution: Hooks + Instinct System

This architecture comes from the claude-code-everything project and consists of two independent subsystems:

Subsystem A: Memory Persistence
  → Answers "what did I do last session?" 
  → Short-term memory, restores work state across sessions

Subsystem B: Instinct Learning (Continuous Learning)
  → Answers "what are the user's habits?"
  → Long-term learning, accumulates behavioral preferences

Both are triggered via Hooks and inject their results into Claude's context at session start.

What Are Hooks?

Hooks are event-driven triggers that fire before and after Claude Code tool calls:

User request → Claude selects tool → PreToolUse hook → Tool executes → PostToolUse hook

Hook Type	Fires When	Key Input
PreToolUse	Before tool execution	tool_name, tool_input
PostToolUse	After tool completes	tool_name, tool_input, tool_output
Stop	After each Claude response	transcript_path (full session JSONL)
SessionStart	Session begins	session_id, cwd (can inject additionalContext)
SessionEnd	Session ends	session_id

PreToolUse hooks can control whether the tool runs: exit 0 continues, exit 2 aborts and surfaces the error to Claude.

What Does an Instinct Look Like?

An Instinct is a single atomic behavioral preference stored as a YAML file:

---
id: grep-before-edit
trigger: "when modifying existing code"
confidence: 0.7
domain: workflow
scope: project
---

# Grep Before Edit

## Action
Use Grep to locate code before Edit to confirm exact location.

## Evidence
- Observed 8 times across sessions
- Pattern: Grep → Read → Edit sequence repeated consistently
- Last observed: 2026-04-16

confidence: 0.3–0.9, controls whether the instinct gets injected (threshold ≥ 0.7)
scope: project (this project only) or global (all projects)
trigger + action: what Claude actually sees when this instinct is active

The Full Learning Pipeline

Tool call
  → observe.sh (async, non-blocking)
      → append to observations.jsonl
      → increment counter, send SIGUSR1 every 20 calls

observer-loop.sh (background daemon)
  → receives SIGUSR1
  → take last 500 observations
  → spawn Claude Haiku analysis (claude --model haiku --print)
      → Haiku identifies behavioral patterns
      → writes instinct YAML by rule:
          3–5 occurrences  → confidence 0.5
          6–10 occurrences → confidence 0.7
          11+ occurrences  → confidence 0.85
  → archive analyzed observations

New session → SessionStart
  → session-start.js reads instinct YAML files
  → filter confidence ≥ 0.7, take top 6
  → inject as additionalContext:

"Active instincts:
- [project 70%] Use Grep to locate code before Edit
- [global 85%] Grep before Edit, Read before Write"

Using Haiku instead of Sonnet for analysis is a deliberate cost decision — pattern recognition doesn't need the most powerful model, and this process fires every 20 tool calls.

Putting It Together

These four mechanisms address four distinct layers of AI Agent engineering:

Layer	Problem	Solution
Stability	AI drifts off-track on long tasks	Harness Engineering (Guides + Sensors)
Reliability	Single agent failure modes, self-evaluation blindspot	Multi-agent architecture (match pattern to problem)
Continuity	Every session starts from zero	CLAUDE.md + Auto Memory + claude-mem
Growth	AI can't accumulate behavioral habits	Hooks + Instinct continuous learning

The direction of travel is clear: from "re-teach the AI every session" to "the more you use it, the more it knows you." None of these solutions are theoretical novelties. They're engineering practices that emerged from real projects hitting real walls — and finding ways through them.

This article is based on the CarPlay bug analysis tool development experience from the Connected Car team. Shared for learning and discussion.

DEV Community