DEV Community: Bruce He

Sub-Agent Architecture for AI Coding Harnesses: When to Spawn, How to Route, What It Costs

Bruce He — Sun, 12 Apr 2026 16:46:30 +0000

Originally published at my blog

Sub-agents are not a parallel speed hack. They are a context garbage collection mechanism. The point is to throw noise away, not to split thinking.

Most engineering teams reach for sub-agents the first time they hit a context window limit, or the first time a task feels "big." They fan out, parallelize, marvel at how fast things go — then spend the next month debugging why outputs keep drifting from each other. The failure mode is predictable: sub-agents that should have stayed in the main thread got fired off, and a single decision that needed shared working memory got split across three cold-started processes that never saw each other's evidence.

This article gives you a decision framework, a concrete routing table across Opus / Sonnet / Haiku, and a cost model — so you stop spawning sub-agents by instinct and start spawning them for reasons you can name.

Three Myths That Burn Money

Myth 1: "More sub-agents means faster completion." Every spawn carries cold-start overhead — system prompt re-tokenization, CLAUDE.md reload, tool schemas re-injected. If your sub-agent only does 2,000 tokens of real work, the overhead can exceed the work itself. Break-even sits at roughly 10,000 input tokens per spawn.

Myth 2: "Sub-agents should always use the cheapest model." Route by decision complexity, not input volume. Haiku reading 100K tokens of logs to emit a 200-token classification is great. Haiku writing 2,000 lines of production code is malpractice.

Myth 3: "The orchestrator should be the smartest model." The most expensive mistake. Orchestration is mostly routing and state tracking — Sonnet or even Haiku handles it fine. Save Opus for the final generation step where judgment compounds.

I restructured my own blog pipeline from "Opus orchestrator + Sonnet workers" to "Sonnet orchestrator + Opus writer + Haiku searchers." End-to-end token cost dropped by ~60% and output quality measurably improved, because Opus was finally being used where it mattered.

The Mental Model

A sub-agent is a fresh heap. You fork an isolated process, let it churn through whatever mess it needs (reading 20 files, running grep, inspecting logs), extract a compact summary, and let the whole thing get garbage-collected. The parent agent never sees the mess.

The primary use cases all share one shape: high input, low output, stateless. Codebase search. Doc triage. Log analysis. The sub-agent's job is to summarize and throw away.

Conversely, this framing tells you when not to use a sub-agent. If the "waste" you would throw away is actually load-bearing context the main agent needs downstream, sub-agents cost you more than they save.

Read the full article →

The full version covers:

Three architecture patterns — Fan-Out / Gather, Scout-Then-Act, Specialist Delegation — with mermaid diagrams and concrete failure modes for each
A complete Opus / Sonnet / Haiku routing table with worked examples by task type
A sub-agent cost formula and the 0.3 overhead-threshold metric for when sub-agents actually pay for themselves
Cognition's Devin post-mortem and what it teaches about multi-agent drift
A real 60% cost reduction case study from rewiring my own writing pipeline

This is Part 3 of the Harness Engineering series. Part 1 framed the thesis (Agent = Model + Harness). Part 2 went deep on CLAUDE.md. If you found this useful, the blog has more AI engineering deep-dives.

Claude Pricing 2026: Complete Guide to Free, Pro, Max & Team Plans

Bruce He — Sat, 04 Apr 2026 03:33:13 +0000

Originally published at my blog

Anthropic's Claude has become one of the most capable AI assistants available, but its pricing structure can be confusing. With five consumer tiers, two Max sub-tiers, a Team plan with mixed seat types, and a separate API — choosing the right plan takes real research.

This guide breaks down every Claude pricing option available in April 2026, including exact costs, usage limits, and practical recommendations for different user types.

Quick Pricing Comparison

Plan	Price	Usage vs Pro	Best For
Free	$0	~0.2x	Trying Claude, occasional questions
Pro	$20/mo	1x (baseline)	Daily individual use
Max 5x	$100/mo	5x	Power users, heavy Claude Code usage
Max 20x	$200/mo	20x	Professional developers, near-unlimited
Team	$25-30/seat/mo	1x-6.25x	Teams of 5-150
Enterprise	Custom	Custom	150+ seats, compliance needs

Read the full article →

The full article covers:

Detailed breakdown of each plan's features and limitations
Claude Code pricing (included plans vs API costs)
API token pricing for Opus 4.6, Sonnet 4.6, and Haiku 4.5
Cost optimization strategies for developers
How to choose the right plan based on your usage

If you found this useful, check out my blog for more AI engineering guides.

Harness Engineering: Why the System Around Your AI Agent Matters More Than the Model

Bruce He — Sat, 04 Apr 2026 01:45:58 +0000

Originally published at heyuan110.com

In 2026, the AI engineering community discovered something counterintuitive: the model is the least important part of an AI agent. What actually determines whether an agent succeeds or fails in production is everything around the model — the tools it can access, the guardrails that keep it safe, the feedback loops that help it self-correct.

This "everything around the model" now has a name: the harness. And the discipline of building it is called harness engineering.

OpenAI's Codex team used harness engineering to ship over 1 million lines of production code written entirely by AI agents. LangChain jumped from #30 to #5 on TerminalBench 2.0 by changing only their harness. A Stanford HAI study found harness-level changes improved output quality by 28-47%, while prompt refinement improved quality by less than 3%.

This guide covers:

The three evolutions: Prompt → Context → Harness Engineering
Core formula: Agent = Model + Harness
Guides (feedforward) + Sensors (feedback) framework
Real cases: OpenAI Codex, LangChain, Stripe Minions
5-level practical implementation guide

Read the full article →

If you found this useful, check out my blog for more AI engineering guides.

Seedance 2.0 Deep Dive: ByteDance AI Video That Tops Sora and Veo

Bruce He — Sat, 04 Apr 2026 01:45:38 +0000

Originally published at heyuan110.com

In February 2026, ByteDance released Seedance 2.0. Within weeks, it hit #1 on the Artificial Analysis text-to-video leaderboard — beating Google Veo 3, OpenAI Sora 2, and Runway Gen-4.5 in blind human evaluation.

If you are reading this from outside China, you have probably heard the buzz but face a wall of confusion: What is Dreamina? What is VolcEngine? Can you even sign up without a Chinese phone number?

This guide is written specifically for international users. It covers the technical architecture in depth (why joint audio-video generation is a real breakthrough), gives an honest assessment of what works and what does not, provides a step-by-step access guide, and explains the IP controversy.

Key findings:

Joint audio-video generation produces the most natural lip sync of any model
Multi-reference input (up to 12 files) enables director-level control
2K max resolution is a limitation vs Kling 3.0's 4K@60fps
~$0.14 per 15-second clip — 5-10x cheaper than competitors
CapCut integration gives it the largest distribution platform of any AI video model

Read the full article →

If you found this useful, check out my blog for more AI engineering guides.

Cursor Composer 2: The Kimi K2.5 Controversy and What It Means

Bruce He — Sat, 04 Apr 2026 01:45:17 +0000

Originally published at heyuan110.com

On March 19, Cursor shipped Composer 2 with a triumphant blog post. Three days later, a developer found kimi-k2p5-rl-0317-s515-fast in the API config. That single string unraveled a story about transparency, open-source ethics, and the global nature of AI infrastructure.

Key findings:

Composer 2 is built on Moonshot AI's Kimi K2.5 (Chinese open-source MoE model)
Cursor's "75% of compute was ours" defense doesn't hold up
CursorBench scores (61.3) are home-field advantage; Terminal-Bench gap vs Claude is only 3.7 points
At $0.50/M input tokens, Composer 2 is 30x cheaper than Opus 4.6
Most productive devs use both: Cursor for 80% daily tasks, Claude Code for 20% complex work

Read the full article →

If you found this useful, check out my blog for more AI engineering guides.

MCP vs Skills vs Hooks in Claude Code: Which Extension Do You Need?

Bruce He — Sat, 04 Apr 2026 01:45:04 +0000

Originally published at heyuan110.com

Claude Code has three distinct extension mechanisms: MCP (Model Context Protocol), Skills, and Hooks. They look related on the surface, but they operate at fundamentally different layers:

Hooks (bottom layer): Lifecycle event automation — "what must always happen"
MCP (middle layer): External tool connections via open protocol — "what can be done"
Skills (top layer): Reusable workflows and domain knowledge — "how to do things well"

This guide covers:

Three-layer architecture diagram
Side-by-side comparison across 8 dimensions
Same task implemented three different ways
Decision framework: when to use which
Common mistakes and how to avoid them

Read the full article →

If you found this useful, check out my blog for more AI engineering guides.

OpenClaw Multi-Agent Configuration: Architecture and Production Patterns

Bruce He — Sat, 04 Apr 2026 01:44:28 +0000

Originally published at heyuan110.com

Your single OpenClaw agent worked great for two weeks. Then it started hallucinating project context into unrelated conversations, confusing coding tasks with writing tasks, and taking 15 seconds to respond because its memory index had grown to 200MB.

The problem is not the model. The problem is architectural: one agent cannot hold unlimited context domains without degradation. The solution is multiple specialized agents with isolated workspaces.

This guide covers:

Why multi-agent (the single-agent ceiling)
Agent creation and model routing configuration
Binding-based routing (most-specific-wins priority)
Agent-to-agent communication via sessions_send
Four production patterns: Supervisor, Router, Pipeline, Parallel
Cost optimization strategies

Read the full article →

If you found this useful, check out my blog for more AI engineering guides.

How to Write CLAUDE.md Files That Actually Work (Harness Engineering #2)

Bruce He — Sat, 04 Apr 2026 01:43:06 +0000

Originally published at heyuan110.com

This is Part 2 of the Harness Engineering series. Most CLAUDE.md files are bad — not because people don't try, but because they optimize for the wrong thing.

ETH Zurich researchers tested 138 agentfiles across multiple AI coding agents. The results:

Human-written, concise (<60 lines): +4% success rate
LLM-generated, verbose (200+ lines): -3% success rate, +20% token cost

LLM-generated files made agents worse.

This guide covers:

The 60-line principle: what to include, what to leave out
Anti-pattern gallery (documentation dump, LLM manifesto, everything file)
Progressive disclosure with Skills
Templates for 3 project types (monorepo, API, frontend)
How to measure if your CLAUDE.md is working

Read the full article →

If you found this useful, check out my blog for more AI engineering guides.