A story about attention dilution, architectural reasoning, and the counterintuitive fix that finally worked.
You've spent weeks refining your AI code-review skill. You've added explicit rules. You've rewritten the checklist. You've added mandatory language: "Execute ALL checklist categories regardless of how many High findings have already been identified."
The next week, a Medium-severity performance issue slips through again.
The model had found 4 High-severity concurrency bugs in the same function. It was warned. The rule was right there in its context. It did it anyway.
Here's the hard truth we learned after many rounds of iteration: you're not dealing with a prompting problem. You're dealing with an architecture problem. And no amount of prompt engineering will fix an architecture problem.
The Bug That Kept Getting Missed
Consider this Go function:
func getBatchUser(ctx context.Context, userKeys []*UserKey) ([]*User, error) {
userList := make([]*User, 0)
var wg sync.WaitGroup
for i, u := range userKeys {
if u == nil { continue }
wg.Add(1)
go func() {
defer wg.Done()
user, err := redis.GetGuest(ctx, u.Id)
if err != nil {
log.WarnContextf(ctx, "no found guest user: %v", u)
continue
}
userList = append(userList, user)
}()
}
return userList, nil
}
There are multiple issues here. Our code-review skill correctly found the four obvious High-severity ones — compile errors, data race, goroutine leak, loop variable capture. But the full issue count, once all rounds of validation completed, was 13. The single skill captured 8. That's a 62% detection rate.
The five missed findings included:
-
No
defer recover()in goroutine — an unhandled panic insidego func()terminates the entire process, not just the goroutine -
Unbounded goroutine spawning — goroutine count scales linearly with
len(userKeys)with no semaphore, rate limit, or worker pool; a large batch exhausts memory -
Missing
wg.Wait()— the function returns before any goroutine completes, makinguserListalways empty and the return value meaningless -
Slice without pre-allocation —
make([]*User, 0)with a known upper boundlen(userKeys)causes repeated reallocation in the hot path -
Silent error discard — errors are logged but not propagated; the caller receives
nil, niland has no way to know which users failed
None of these were exotic edge cases. All were in the skill's explicit checklist. When we pointed out the most instructive miss — slice pre-allocation — the model acknowledged it immediately:
"After 4 High-severity concurrent defects consumed my attention, I was not careful enough walking through the Performance checklist and mistakenly categorized this as 'a minor issue that can be ignored' without formally reporting it."
The model knew the rule. It had the checklist. It still didn't apply it. This pattern — High-severity findings crowding out Medium-severity ones across multiple dimensions — reproduced consistently across test cases. It wasn't random variance. It was structural.
Why Prompts Can't Fix This
When an AI model handles 5 review dimensions in a single call, all that knowledge coexists in one context window:
[Single Agent's Context Window]
┌─────────────────────────────────────────┐
│ Security rules (SQL injection, ...) │
│ Concurrency rules (races, leaks...) │
│ Performance rules (pre-alloc, ...) │ ← squeezed out
│ Error-handling rules (wrap, nil...) │ ← squeezed out
│ Quality rules (naming, structure...) │ ← squeezed out
│ │
│ Findings found so far: │
│ ├── [High] compile error ←─────┐ │
│ ├── [High] data race ←─────┤ attention here
│ ├── [High] goroutine leak ←─────┤ │
│ └── [High] loop capture ←─────┘ │
│ │
│ Performance checklist: │
│ Slice Pre-allocation → ??? (skipped) │ ← insufficient attention
└─────────────────────────────────────────┘
We tried every reasonable prompt fix:
| Mitigation | Effect | Limitation |
|---|---|---|
| "Execute ALL checklist categories regardless of High findings" | Partially effective | The rule itself competes for attention in the same context |
| Memory note: "High findings must not cause skipping" | Helps next session | Does not fix multi-dimension competition in the current call |
| Stronger mandatory language + repetition | Limited improvement | LLM attention allocation is probabilistic; instructions can't override it |
Each fix reduced the miss rate somewhat. None eliminated it. The ceiling was about 67% for the model-execution class of misses — documented across multiple real cases. The remaining 33% persisted no matter how strongly we phrased the instruction.
This is not a prompting problem. The model's attention is finite and shared across everything in the context window. When High findings accumulate, they dominate attention at inference time. This is structural.
Why Multi-Agent Is the Right Direction
The evolution here mirrors what happened in software engineering when monolithic codebases grew too large to maintain:
| Software Evolution | AI Agent Evolution |
|---|---|
| Monolith codebase too large to maintain | Single agent context window accumulates too much, performance degrades |
| Single-point failure affects the whole system | One dimension's High findings contaminate the entire review |
| Cannot scale modules independently | Cannot choose optimal model per task type |
| Responsibility boundaries blurry | Agent role confusion degrades output quality |
Just as large monolithic applications eventually need microservices, a monolithic agent needs vertical specialization when the task is complex enough.
A Multi-Agent architecture means multiple AI agents collaborate under clear role assignments — each with its own context window, a dedicated toolset, and well-defined responsibilities. For Go code review, this maps to four concrete advantages:
| Advantage | Mechanism | What it means here |
|---|---|---|
| Focused context window | Each sub-agent runs in a fresh, clean context uncontaminated by other dimensions' findings | Concurrency finding 4 High issues does not affect Performance's sensitivity to make([]*User, 0)
|
| Deep specialization | Each agent's prompt focuses on a single domain with a minimal toolset | Security agent sees only security defects; no need to juggle five dimensions at once |
| Multi-perspective quality assurance | Multiple agents evaluate independently, unaware of each other's findings | Cross-dimension cross-validation, not just serial checklists |
| Flexible model assignment | Lead uses a stronger model for triage and aggregation; workers use faster models for review | Triage + deduplication with Sonnet; workers with Haiku to control cost |
Anthropic's internal research provides quantitative support: in the BrowseComp benchmark, token usage alone explained 80% of performance variance across agents. The key factor wasn't model capability — it was how much "clean context" each agent had to work with. Context contamination degrades single-agent performance in a measurable, predictable way.
Choosing the Right Orchestration Pattern
Once you've decided to go Multi-Agent, the next question is: which orchestration pattern?
Anthropic defines five foundational patterns. We evaluated all five against the Go code-review scenario before settling on one:
| Pattern | Core Mechanism | Assessment for this scenario | Fit? |
|---|---|---|---|
| 1. Prompt Chaining | Linear step sequence; each step's output feeds the next | Security/concurrency/performance dimensions have no sequential dependencies — not a sequencing problem | ✗ |
| 2. Routing | Classify input, route to one specialized handler | A single review must cover multiple dimensions simultaneously, not pick one | ✗ |
| 3. Parallelization | Multiple parallel paths; subtasks fixed at design time | Close to what's needed, but fixed subtasks mean all branches always run — can't prune based on content | △ |
| 4. Orchestrator-Workers | Central orchestrator dynamically decomposes tasks, dispatches workers on demand | Best match — review dimensions are determined by code content at runtime | ✓ |
| 5. Evaluator-Optimizer | Generate → evaluate → refine iterative loop | Code review is a diagnostic task, not an iterative generation task | ✗ |
The key distinction is between Pattern 3 and Pattern 4. Both support parallelism. The difference is where subtasks come from:
Parallelization (Pattern 3):
Code → [Fixed dispatch: Security + Performance + Quality + Logic + ...] → Aggregate
Subtasks are fixed at design time; every review runs all N paths
Orchestrator-Workers (Pattern 4):
Code → [Lead Agent analyzes diff] → Dynamic decision → Dispatch K paths (K ≤ N) → Aggregate
Subtasks are decided at runtime based on code content
Which agents to dispatch depends on what the code actually contains:
- Code only renames variables → Quality + Logic (2 agents)
- Code introduces
go func+sync.WaitGroup→ also Concurrency + Error (4 agents) - Code contains
make([]*T, 0)+ batch function names → also Performance (5 agents) - Code has
_test.gochanges → also Test (6 agents)
This "content-driven dimension selection" cannot be known at design time. The orchestrator must decide dynamically at runtime — exactly the scenario Anthropic defines as the Orchestrator-Workers applicable case: "Cannot predict which subtasks will be needed in advance; the Orchestrator must decide dynamically based on input."
Forcing Pattern 3 means launching all 7 agents on every review. A 5-line variable rename incurs the same token cost as a full concurrency + security audit. Triage is the orchestrator's core value.
The Architecture: Skill-Agent Collaboration
Architecture Overview
PR Diff / Code Snippet
│
↓
[Main conversation + go-review-lead Skill]
Role: triage + dispatch + aggregation
Does NOT load vertical review Skills
Does NOT directly review code
│
Phases 1-4: Triage
grep + pattern matching → which dimensions?
│
┌────────┬────────┬───────┼───────┬────────┬────────┐
↓ ↓ ↓ ↓ ↓ ↓ ↓
[Security][Concurr][Perf] [Error] [Quality] [Test] [Logic]
Agent Agent Agent Agent Agent Agent Agent
│ │ │ │ │ │ │
Load Load Load Load Load Load Load
security concurr perf error quality test logic
Skill Skill Skill Skill Skill Skill Skill
│ │ │ │ │ │ │
Review Review Review Review Review Review Review
independently in each clean context
└────────┴────────┴───────┴───────┴────────┴────────┘
│
↓
Main conversation aggregates
Merge findings + deduplicate + sort by severity
│
↓
Final report
Three architecture options were considered. Two simpler ones failed:
| Architecture | Characteristics | Known problems | Recommended? |
|---|---|---|---|
| A: Single Skill | 1 agent, all review knowledge, one call | Attention dilution; High findings suppress other dimensions; proven misses | Basic scenarios only |
| B: Multi-Agent, no Skills | 7 agents, prompt-only, no Skills loaded | Clean context, but no domain review rules; relies on AI general knowledge | Not recommended |
| C: Multi-Agent + vertical Skills | Main conversation orchestrates via Skill; 7 workers each load one domain Skill | Slightly higher design cost | ✅ Recommended |
Core Design Principles
Principle 1: Each agent loads exactly one dimension's Skill.
A Performance Agent's context contains only performance-related knowledge and the code under review — no other dimensions' rules, no other agents' findings. This significantly raises the probability that the model focuses its attention on the Performance checklist.
Principle 2: The orchestrator does not review code.
After loading go-review-lead, the main conversation acts as a neutral coordinator — triage and aggregation only. If the orchestrator also reviewed code, its own findings would bias its aggregation of workers' results, recreating the same attention-competition problem as the heavy single skill.
Principle 3: The orchestration logic must be a Skill in the main conversation, not an agent definition.
Claude Code subagents cannot spawn other subagents. If go-review-lead were configured as an agent definition file, its parallel dispatch calls to the 7 vertical agents would be silently ignored — they'd degrade to serial execution or not run at all. The orchestration Skill runs in the main conversation, not in .claude/agents/.
Back to the Case: Why Misses Are Less Likely
| Agent | Context contains | What it finds in this case |
|---|---|---|
| Concurrency Agent | only concurrency rules + code | 4 High findings (races, leaks, loop capture, wg.Wait()) |
| Performance Agent | only performance rules + code | make([]*User, 0) pre-allocation miss — significantly less likely to be crowded out |
| Error Agent | only error-handling rules + code | silent error discard, continue inside goroutine |
| Quality Agent | only quality rules + code | unused variables, naming issues |
| Logic Agent | only logic rules + code | return contract violation |
| Lead (orchestrator) | only the 5 structured reports | merge, deduplicate, sort by severity |
The Performance Agent does not need to be aware of those 4 High concurrency bugs. Its context holds only the Performance checklist and the code. For checklist items like Slice Pre-allocation, this isolation substantially reduces the risk of attention being captured by severity-dominant findings from other dimensions. It does not make misses impossible — but it makes them significantly less likely and more systematic to diagnose when they do occur.
Skills and Agents: Not the Same Thing
One concept worth clarifying before you implement this, because it's easy to conflate them:
| Skill | Agent | |
|---|---|---|
| What it is | A knowledge package — workflow, checklist, reference rules | An execution unit — an LLM instance with its own context window |
| Answers | "What to do" and "how to do it" | "Who does it" and "where it runs" |
| Lives in |
SKILL.md + references/ + scripts/
|
Agent definition file (role + tools + which Skill to load) |
| Core value | Encodes domain expertise; makes AI exceed general capability | Provides execution isolation; each task runs in a clean context |
A Skill without an Agent is expertise that still runs in a polluted context. An Agent without a Skill is isolation without domain knowledge. Architecture C combines both.
At runtime, an agent loads its Skill on demand — it does not copy the Skill's content into its own definition:
Performance Agent (independent context window) starts
│
│ Reads its definition: "load go-performance-review Skill"
│ Calls Skill("go-performance-review")
│ ↓
│ SKILL.md checklist and rules load into current context
│
│ Context now contains:
│ ✓ Performance checklist (from Skill)
│ ✓ Code under review
│ ✗ Concurrency rules (not loaded)
│ ✗ Other agents' findings (isolated)
│
↓
Executes review → returns structured result
Agent definition files stay lightweight (role + tools + which Skill to load). Skill files stay independent and reusable across agents. No duplication.
The Counterintuitive Finding
With the architecture designed, we ran a controlled experiment:
| Configuration | Model | Findings captured | Detection rate |
|---|---|---|---|
| Single Agent | Opus 4 | 8/13 | 62% |
| Multi-Agent Orchestrator-Workers | Sonnet 4 Workers | 13/13 | 100% |
A fleet of mid-tier Sonnet agents outperformed a single top-tier Opus agent. Not because Sonnet is "smarter" — Opus genuinely outperforms Sonnet on a focused single task. The difference is task structure.
When Opus handles 5 dimensions simultaneously, attention dilution systematically degrades its per-dimension performance. When Sonnet handles only one dimension in a clean context, it operates near full focus with no cross-dimension competition. Sonnet × N focused agents can outperform Opus × 1 generalist agent on multi-dimensional tasks.
This changes the question you should be asking. The old question: "Which is the most powerful model I should use?" The better question: "Can I restructure my task so each agent only needs to excel at one thing?"
Three Experiments, Three Lessons
Before the final architecture stabilized, we ran three validation rounds on the same getBatchUser function. Each round taught us something unexpected.
Round 1: Single Skill Baseline
Result: 8/13 (62%) — 5 findings missed
The single go-code-reviewer skill found 4 High-severity concurrency bugs but missed the full set of Medium-severity findings: slice pre-allocation, unbounded goroutine spawning, no panic recovery in goroutines, missing wg.Wait(), and silent error discard. The model acknowledged the misses when prompted. Architecture refactoring was warranted.
Round 2: Multi-Agent v1 (No Grep-Gated)
We split the single skill into 7 vertical agents — Security, Concurrency, Performance, Error, Quality, Test, Logic — each running in a clean independent context.
Result: Unstable — sometimes captured, sometimes not
Two new problems emerged:
Problem 1: Triage blind spot. The orchestrator's Phase 2 trigger for the Performance agent was looking for make calls with a capacity argument. make([]*User, 0) — the case without a capacity argument — was the very pattern we needed to catch. The trigger fired in reverse. The Performance agent was never dispatched.
Problem 2: Within-dimension attention dilution. Even with the Concurrency agent running in its own clean context, when that context contained 4 High-severity compile errors and data races, "unbounded goroutine creation" (Medium-severity) still got deprioritized. Isolated contexts solved cross-dimension dilution. They did not solve within-dimension dilution when severity disparity was high enough.
The actual report came back with:
- Skipped skills: go-performance-reviewer (no hot-path loops or DB patterns)
← triage blind spot caused the skip
Residual Risk:
Unbounded goroutine spawning: Not flagged as a finding since expected
batch size is unknown ← buried in Residual Risk, not formally reported
Architecture refactoring alone wasn't enough. We needed both isolated contexts and a mechanism that didn't depend on attention to walk the checklist.
Round 3: Multi-Agent + Grep-Gated
Two fixes:
Fix the triage: rewrite the Phase 2 trigger to detect zero-capacity
makeexplicitly, and add a Phase 3 heuristic — function names containingBatch,Multi, orGetAllautomatically trigger the Performance agent.getBatchUserhits directly.Introduce the Grep-Gated protocol: for checklist items with clear syntactic features, run a mechanical
grepscan before asking the model to reason about them.
Result: 13/13 (100%) — stable across repeated runs
The Grep-Gated Execution Protocol
This is the mechanism that made Round 3 work, and it's worth explaining carefully.
The key insight: the model is not a human code reviewer. It has tools it can use.
Human reviewers scan with their eyes and rely on attention to find matches. We were asking the AI to do the same — use "attention" to walk through a checklist and search for matches. But grep is deterministic. It doesn't get tired. It doesn't have "attention."
The execution flow for each sub-agent:
1. Load the domain Skill (checklist + rules + grep patterns)
2. Write code to $TMPDIR/review_snippet.go
3. For all grep-gated checklist items → run grep with patterns from the Skill
4. grep HIT → model performs semantic confirmation (true positive vs false positive)
5. grep MISS → automatically mark NOT FOUND, skip semantic analysis
6. Items without grep patterns (pure semantic) → full model reasoning
7. Report only FOUND items
8. Audit line: "Grep pre-scan: X/Y items hit, Z confirmed"
Coverage across 7 skills, 86 checklist items:
| Skill | Total Items | Grep-able | Semantic Only |
|---|---|---|---|
| go-concurrency-review | 14 | 13 | 1 |
| go-performance-review | 12 | 10 | 2 |
| go-error-review | 12 | 12 | 0 |
| go-security-review | 16 | 14 | 2 |
| go-quality-review | 12 | 8 | 4 |
| go-test-review | 10 | 8 | 2 |
| go-logic-review | 10 | 0 | 10 |
| Total | 86 | 65 (75%) | 21 (25%) |
75% of checklist items are now mechanically pre-scanned. The model's attention is reserved for the remaining 25% of genuinely semantic items — and for confirming grep hits rather than searching for them.
The slice pre-allocation miss, which survived two rounds of architecture improvements, was caught in Round 3 via:
Grep hit: make([]*User, 0) at L10.
Function name getBatchUser signals batch hot path.
→ REV-009 [Medium] formally reported.
A grep pattern doesn't get its attention crowded out by High-severity findings. That's the point.
One design principle worth noting: the protocol uses a wide-net strategy — prefer false-positive grep hits over false-negative misses. A false positive costs one extra semantic confirmation. A false negative means the issue is permanently gone. Pattern design should err toward broader matches.
Cost vs. Quality
| Approach | Simple style PR | Complex concurrency PR | Full-scope refactor |
|---|---|---|---|
| All 7 agents (no triage) | ~$0.16 | ~$0.16 | ~$0.16 |
| Triage + on-demand dispatch | ~$0.02 | ~$0.07 | ~$0.10 |
| Original single skill | ~$0.03 | ~$0.03 (but misses) | ~$0.03 (but misses) |
On simple PRs, triage saves ~87% of cost versus running everything. On complex PRs, cost is comparable to the full fleet — but quality is significantly better than the single-skill approach. The triage cost itself (Level 1 file-type grep + Level 2 fast model diff scan) runs under $0.001 per call — negligible.
The Broader Principle
We learned something that has changed how we think about AI engineering decisions:
For multi-dimensional tasks, the limiting factor is not model capability — it's context organization.
Opus in a polluted context loses to Sonnet in an isolated one. More compute applied to the wrong architecture doesn't solve the problem; it makes it more expensive. You don't need to wait for the next-generation model to fix systematic misses — architecture refactoring works on the models you already have, and it's more controllable and more predictable than hoping a stronger model pays more attention.
The decision framework shifts: from "which model should I use?" to "how should I structure the task so each agent only needs to succeed at one thing?"
The Implementation Is Open Source
Everything described in this article is published and deployable. The directory layout:
skills/
├── go-review-lead/SKILL.md # orchestration logic — runs in main conversation
├── go-security-review/SKILL.md # SQL injection, XSS, key leakage, permissions
├── go-concurrency-review/SKILL.md # races, goroutine leaks, deadlocks, WaitGroup
│ └── references/go-concurrency-patterns.md
├── go-performance-review/SKILL.md # pre-allocation, N+1, indexes, memory
│ └── references/go-performance-patterns.md
├── go-error-review/SKILL.md # error wrapping, resource close, panic handling
├── go-quality-review/SKILL.md # naming, structure, lint rules
├── go-test-review/SKILL.md # coverage, assertion quality, test isolation
└── go-logic-review/SKILL.md # business logic, boundaries, nil propagation
.claude/agents/ # 7 vertical worker agents — drop in and use
├── go-security-reviewer.md
├── go-concurrency-reviewer.md
├── go-performance-reviewer.md
├── go-error-reviewer.md
├── go-quality-reviewer.md
├── go-test-reviewer.md
└── go-logic-reviewer.md
To deploy: copy the skills/ directories to ~/.claude/skills/ (user-level) or .claude/skills/ (project-level), copy the agent definition files to .claude/agents/, then invoke go-review-lead from the main conversation. The deployment guide with prerequisites and usage examples is at outputexample/go-review-lead/README.md.
The full methodology — skill design, quantitative A/B evaluation, golden test fixtures, zero-LLM regression tests, and the iteration framework that produced these results — is at:
github.com/johnqtcg/awesome-skills
The 29 production-ready skills and 42 paired evaluation reports (EN + ZH) are the examples; the methodology is the deliverable.
If you've hit the same wall — model keeps missing things despite explicit rules — the diagnosis is probably the same one we found. The context window is not infinitely attentive. Architecture is the lever that prompts can't reach.
Top comments (0)