jin li

Posted on Apr 7

Architecture Over Model: How We Got 13/13 Bug Detection Without Upgrading to a Stronger AI

#ai #architecture #codequality #llm

A story about attention dilution, architectural reasoning, and the counterintuitive fix that finally worked.

You've spent weeks refining your AI code-review skill. You've added explicit rules. You've rewritten the checklist. You've added mandatory language: "Execute ALL checklist categories regardless of how many High findings have already been identified."

The next week, a Medium-severity performance issue slips through again.

The model had found 4 High-severity concurrency bugs in the same function. It was warned. The rule was right there in its context. It did it anyway.

Here's the hard truth we learned after many rounds of iteration: you're not dealing with a prompting problem. You're dealing with an architecture problem. And no amount of prompt engineering will fix an architecture problem.

The Bug That Kept Getting Missed

Consider this Go function:

func getBatchUser(ctx context.Context, userKeys []*UserKey) ([]*User, error) {
    userList := make([]*User, 0)

    var wg sync.WaitGroup
    for i, u := range userKeys {
        if u == nil { continue }
        wg.Add(1)
        go func() {
            defer wg.Done()
            user, err := redis.GetGuest(ctx, u.Id)
            if err != nil {
                log.WarnContextf(ctx, "no found guest user: %v", u)
                continue
            }
            userList = append(userList, user)
        }()
    }
    return userList, nil
}

There are multiple issues here. Our code-review skill correctly found the four obvious High-severity ones — compile errors, data race, goroutine leak, loop variable capture. But the full issue count, once all rounds of validation completed, was 13. The single skill captured 8. That's a 62% detection rate.

The five missed findings included:

No defer recover() in goroutine — an unhandled panic inside go func() terminates the entire process, not just the goroutine
Unbounded goroutine spawning — goroutine count scales linearly with len(userKeys) with no semaphore, rate limit, or worker pool; a large batch exhausts memory
Missing wg.Wait() — the function returns before any goroutine completes, making userList always empty and the return value meaningless
Slice without pre-allocation — make([]*User, 0) with a known upper bound len(userKeys) causes repeated reallocation in the hot path
Silent error discard — errors are logged but not propagated; the caller receives nil, nil and has no way to know which users failed

None of these were exotic edge cases. All were in the skill's explicit checklist. When we pointed out the most instructive miss — slice pre-allocation — the model acknowledged it immediately:

"After 4 High-severity concurrent defects consumed my attention, I was not careful enough walking through the Performance checklist and mistakenly categorized this as 'a minor issue that can be ignored' without formally reporting it."

The model knew the rule. It had the checklist. It still didn't apply it. This pattern — High-severity findings crowding out Medium-severity ones across multiple dimensions — reproduced consistently across test cases. It wasn't random variance. It was structural.

Why Prompts Can't Fix This

When an AI model handles 5 review dimensions in a single call, all that knowledge coexists in one context window:

[Single Agent's Context Window]
┌─────────────────────────────────────────┐
│ Security rules (SQL injection, ...)     │
│ Concurrency rules (races, leaks...)     │
│ Performance rules (pre-alloc, ...)      │  ← squeezed out
│ Error-handling rules (wrap, nil...)     │  ← squeezed out
│ Quality rules (naming, structure...)    │  ← squeezed out
│                                         │
│ Findings found so far:                  │
│ ├── [High] compile error    ←─────┐     │
│ ├── [High] data race        ←─────┤ attention here
│ ├── [High] goroutine leak   ←─────┤     │
│ └── [High] loop capture     ←─────┘     │
│                                         │
│ Performance checklist:                  │
│   Slice Pre-allocation → ??? (skipped)  │  ← insufficient attention
└─────────────────────────────────────────┘

We tried every reasonable prompt fix:

Mitigation	Effect	Limitation
"Execute ALL checklist categories regardless of High findings"	Partially effective	The rule itself competes for attention in the same context
Memory note: "High findings must not cause skipping"	Helps next session	Does not fix multi-dimension competition in the current call
Stronger mandatory language + repetition	Limited improvement	LLM attention allocation is probabilistic; instructions can't override it

Each fix reduced the miss rate somewhat. None eliminated it. The ceiling was about 67% for the model-execution class of misses — documented across multiple real cases. The remaining 33% persisted no matter how strongly we phrased the instruction.

This is not a prompting problem. The model's attention is finite and shared across everything in the context window. When High findings accumulate, they dominate attention at inference time. This is structural.

Why Multi-Agent Is the Right Direction

The evolution here mirrors what happened in software engineering when monolithic codebases grew too large to maintain:

Software Evolution	AI Agent Evolution
Monolith codebase too large to maintain	Single agent context window accumulates too much, performance degrades
Single-point failure affects the whole system	One dimension's High findings contaminate the entire review
Cannot scale modules independently	Cannot choose optimal model per task type
Responsibility boundaries blurry	Agent role confusion degrades output quality

Just as large monolithic applications eventually need microservices, a monolithic agent needs vertical specialization when the task is complex enough.

A Multi-Agent architecture means multiple AI agents collaborate under clear role assignments — each with its own context window, a dedicated toolset, and well-defined responsibilities. For Go code review, this maps to four concrete advantages:

Advantage	Mechanism	What it means here
Focused context window	Each sub-agent runs in a fresh, clean context uncontaminated by other dimensions' findings	Concurrency finding 4 High issues does not affect Performance's sensitivity to `make([]*User, 0)`
Deep specialization	Each agent's prompt focuses on a single domain with a minimal toolset	Security agent sees only security defects; no need to juggle five dimensions at once
Multi-perspective quality assurance	Multiple agents evaluate independently, unaware of each other's findings	Cross-dimension cross-validation, not just serial checklists
Flexible model assignment	Lead uses a stronger model for triage and aggregation; workers use faster models for review	Triage + deduplication with Sonnet; workers with Haiku to control cost

Anthropic's internal research provides quantitative support: in the BrowseComp benchmark, token usage alone explained 80% of performance variance across agents. The key factor wasn't model capability — it was how much "clean context" each agent had to work with. Context contamination degrades single-agent performance in a measurable, predictable way.

Choosing the Right Orchestration Pattern

Once you've decided to go Multi-Agent, the next question is: which orchestration pattern?

Anthropic defines five foundational patterns. We evaluated all five against the Go code-review scenario before settling on one:

Pattern	Core Mechanism	Assessment for this scenario	Fit?
1. Prompt Chaining	Linear step sequence; each step's output feeds the next	Security/concurrency/performance dimensions have no sequential dependencies — not a sequencing problem	✗
2. Routing	Classify input, route to one specialized handler	A single review must cover multiple dimensions simultaneously, not pick one	✗
3. Parallelization	Multiple parallel paths; subtasks fixed at design time	Close to what's needed, but fixed subtasks mean all branches always run — can't prune based on content	△
4. Orchestrator-Workers	Central orchestrator dynamically decomposes tasks, dispatches workers on demand	Best match — review dimensions are determined by code content at runtime	✓
5. Evaluator-Optimizer	Generate → evaluate → refine iterative loop	Code review is a diagnostic task, not an iterative generation task	✗

The key distinction is between Pattern 3 and Pattern 4. Both support parallelism. The difference is where subtasks come from:

Parallelization (Pattern 3):
  Code → [Fixed dispatch: Security + Performance + Quality + Logic + ...] → Aggregate
  Subtasks are fixed at design time; every review runs all N paths

Orchestrator-Workers (Pattern 4):
  Code → [Lead Agent analyzes diff] → Dynamic decision → Dispatch K paths (K ≤ N) → Aggregate
  Subtasks are decided at runtime based on code content

Which agents to dispatch depends on what the code actually contains:

Code only renames variables → Quality + Logic (2 agents)
Code introduces go func + sync.WaitGroup → also Concurrency + Error (4 agents)
Code contains make([]*T, 0) + batch function names → also Performance (5 agents)
Code has _test.go changes → also Test (6 agents)

This "content-driven dimension selection" cannot be known at design time. The orchestrator must decide dynamically at runtime — exactly the scenario Anthropic defines as the Orchestrator-Workers applicable case: "Cannot predict which subtasks will be needed in advance; the Orchestrator must decide dynamically based on input."

Forcing Pattern 3 means launching all 7 agents on every review. A 5-line variable rename incurs the same token cost as a full concurrency + security audit. Triage is the orchestrator's core value.

The Architecture: Skill-Agent Collaboration

Architecture Overview

                      PR Diff / Code Snippet
                             │
                             ↓
              [Main conversation + go-review-lead Skill]
                   Role: triage + dispatch + aggregation
                   Does NOT load vertical review Skills
                   Does NOT directly review code
                             │
                     Phases 1-4: Triage
                     grep + pattern matching → which dimensions?
                             │
    ┌────────┬────────┬───────┼───────┬────────┬────────┐
    ↓        ↓        ↓       ↓       ↓        ↓        ↓
[Security][Concurr][Perf] [Error] [Quality] [Test] [Logic]
  Agent    Agent   Agent  Agent   Agent    Agent   Agent
    │        │       │      │       │        │       │
  Load     Load    Load   Load    Load     Load    Load
 security concurr  perf  error  quality   test   logic
  Skill    Skill   Skill  Skill   Skill    Skill   Skill
    │        │       │      │       │        │       │
 Review   Review  Review Review  Review   Review  Review
          independently in each clean context
    └────────┴────────┴───────┴───────┴────────┴────────┘
                             │
                             ↓
                  Main conversation aggregates
              Merge findings + deduplicate + sort by severity
                             │
                             ↓
                         Final report

Three architecture options were considered. Two simpler ones failed:

Architecture	Characteristics	Known problems	Recommended?
A: Single Skill	1 agent, all review knowledge, one call	Attention dilution; High findings suppress other dimensions; proven misses	Basic scenarios only
B: Multi-Agent, no Skills	7 agents, prompt-only, no Skills loaded	Clean context, but no domain review rules; relies on AI general knowledge	Not recommended
C: Multi-Agent + vertical Skills	Main conversation orchestrates via Skill; 7 workers each load one domain Skill	Slightly higher design cost	✅ Recommended

Core Design Principles

Principle 1: Each agent loads exactly one dimension's Skill.
A Performance Agent's context contains only performance-related knowledge and the code under review — no other dimensions' rules, no other agents' findings. This significantly raises the probability that the model focuses its attention on the Performance checklist.

Principle 2: The orchestrator does not review code.
After loading go-review-lead, the main conversation acts as a neutral coordinator — triage and aggregation only. If the orchestrator also reviewed code, its own findings would bias its aggregation of workers' results, recreating the same attention-competition problem as the heavy single skill.

Principle 3: The orchestration logic must be a Skill in the main conversation, not an agent definition.
Claude Code subagents cannot spawn other subagents. If go-review-lead were configured as an agent definition file, its parallel dispatch calls to the 7 vertical agents would be silently ignored — they'd degrade to serial execution or not run at all. The orchestration Skill runs in the main conversation, not in .claude/agents/.

Back to the Case: Why Misses Are Less Likely

Agent	Context contains	What it finds in this case
Concurrency Agent	only concurrency rules + code	4 High findings (races, leaks, loop capture, `wg.Wait()`)
Performance Agent	only performance rules + code	*`make([]User, 0)` pre-allocation miss — significantly less likely to be crowded out**
Error Agent	only error-handling rules + code	silent error discard, `continue` inside goroutine
Quality Agent	only quality rules + code	unused variables, naming issues
Logic Agent	only logic rules + code	return contract violation
Lead (orchestrator)	only the 5 structured reports	merge, deduplicate, sort by severity

The Performance Agent does not need to be aware of those 4 High concurrency bugs. Its context holds only the Performance checklist and the code. For checklist items like Slice Pre-allocation, this isolation substantially reduces the risk of attention being captured by severity-dominant findings from other dimensions. It does not make misses impossible — but it makes them significantly less likely and more systematic to diagnose when they do occur.

Skills and Agents: Not the Same Thing

One concept worth clarifying before you implement this, because it's easy to conflate them:

	Skill	Agent
What it is	A knowledge package — workflow, checklist, reference rules	An execution unit — an LLM instance with its own context window
Answers	"What to do" and "how to do it"	"Who does it" and "where it runs"
Lives in	`SKILL.md` + `references/` + `scripts/`	Agent definition file (role + tools + which Skill to load)
Core value	Encodes domain expertise; makes AI exceed general capability	Provides execution isolation; each task runs in a clean context

A Skill without an Agent is expertise that still runs in a polluted context. An Agent without a Skill is isolation without domain knowledge. Architecture C combines both.

At runtime, an agent loads its Skill on demand — it does not copy the Skill's content into its own definition:

Performance Agent (independent context window) starts
  │
  │ Reads its definition: "load go-performance-review Skill"
  │ Calls Skill("go-performance-review")
  │                    ↓
  │         SKILL.md checklist and rules load into current context
  │
  │ Context now contains:
  │   ✓ Performance checklist (from Skill)
  │   ✓ Code under review
  │   ✗ Concurrency rules (not loaded)
  │   ✗ Other agents' findings (isolated)
  │
  ↓
  Executes review → returns structured result

Agent definition files stay lightweight (role + tools + which Skill to load). Skill files stay independent and reusable across agents. No duplication.

The Counterintuitive Finding

With the architecture designed, we ran a controlled experiment:

Configuration	Model	Findings captured	Detection rate
Single Agent	Opus 4	8/13	62%
Multi-Agent Orchestrator-Workers	Sonnet 4 Workers	13/13	100%

A fleet of mid-tier Sonnet agents outperformed a single top-tier Opus agent. Not because Sonnet is "smarter" — Opus genuinely outperforms Sonnet on a focused single task. The difference is task structure.

When Opus handles 5 dimensions simultaneously, attention dilution systematically degrades its per-dimension performance. When Sonnet handles only one dimension in a clean context, it operates near full focus with no cross-dimension competition. Sonnet × N focused agents can outperform Opus × 1 generalist agent on multi-dimensional tasks.

This changes the question you should be asking. The old question: "Which is the most powerful model I should use?" The better question: "Can I restructure my task so each agent only needs to excel at one thing?"

Three Experiments, Three Lessons

Before the final architecture stabilized, we ran three validation rounds on the same getBatchUser function. Each round taught us something unexpected.

Round 1: Single Skill Baseline

Result: 8/13 (62%) — 5 findings missed

The single go-code-reviewer skill found 4 High-severity concurrency bugs but missed the full set of Medium-severity findings: slice pre-allocation, unbounded goroutine spawning, no panic recovery in goroutines, missing wg.Wait(), and silent error discard. The model acknowledged the misses when prompted. Architecture refactoring was warranted.

Round 2: Multi-Agent v1 (No Grep-Gated)

We split the single skill into 7 vertical agents — Security, Concurrency, Performance, Error, Quality, Test, Logic — each running in a clean independent context.

Result: Unstable — sometimes captured, sometimes not

Two new problems emerged:

Problem 1: Triage blind spot. The orchestrator's Phase 2 trigger for the Performance agent was looking for make calls with a capacity argument. make([]*User, 0) — the case without a capacity argument — was the very pattern we needed to catch. The trigger fired in reverse. The Performance agent was never dispatched.

Problem 2: Within-dimension attention dilution. Even with the Concurrency agent running in its own clean context, when that context contained 4 High-severity compile errors and data races, "unbounded goroutine creation" (Medium-severity) still got deprioritized. Isolated contexts solved cross-dimension dilution. They did not solve within-dimension dilution when severity disparity was high enough.

The actual report came back with:

- Skipped skills: go-performance-reviewer (no hot-path loops or DB patterns)
                              ← triage blind spot caused the skip

Residual Risk:
  Unbounded goroutine spawning: Not flagged as a finding since expected
  batch size is unknown          ← buried in Residual Risk, not formally reported

Architecture refactoring alone wasn't enough. We needed both isolated contexts and a mechanism that didn't depend on attention to walk the checklist.

Round 3: Multi-Agent + Grep-Gated

Two fixes:

Fix the triage: rewrite the Phase 2 trigger to detect zero-capacity make explicitly, and add a Phase 3 heuristic — function names containing Batch, Multi, or GetAll automatically trigger the Performance agent. getBatchUser hits directly.
Introduce the Grep-Gated protocol: for checklist items with clear syntactic features, run a mechanical grep scan before asking the model to reason about them.

Result: 13/13 (100%) — stable across repeated runs

The Grep-Gated Execution Protocol

This is the mechanism that made Round 3 work, and it's worth explaining carefully.

The key insight: the model is not a human code reviewer. It has tools it can use.

Human reviewers scan with their eyes and rely on attention to find matches. We were asking the AI to do the same — use "attention" to walk through a checklist and search for matches. But grep is deterministic. It doesn't get tired. It doesn't have "attention."

The execution flow for each sub-agent:

1. Load the domain Skill (checklist + rules + grep patterns)
2. Write code to $TMPDIR/review_snippet.go
3. For all grep-gated checklist items → run grep with patterns from the Skill
4. grep HIT  → model performs semantic confirmation (true positive vs false positive)
5. grep MISS → automatically mark NOT FOUND, skip semantic analysis
6. Items without grep patterns (pure semantic) → full model reasoning
7. Report only FOUND items
8. Audit line: "Grep pre-scan: X/Y items hit, Z confirmed"

Coverage across 7 skills, 86 checklist items:

Skill	Total Items	Grep-able	Semantic Only
go-concurrency-review	14	13	1
go-performance-review	12	10	2
go-error-review	12	12	0
go-security-review	16	14	2
go-quality-review	12	8	4
go-test-review	10	8	2
go-logic-review	10	0	10
Total	86	65 (75%)	21 (25%)

75% of checklist items are now mechanically pre-scanned. The model's attention is reserved for the remaining 25% of genuinely semantic items — and for confirming grep hits rather than searching for them.

The slice pre-allocation miss, which survived two rounds of architecture improvements, was caught in Round 3 via:

Grep hit: make([]*User, 0) at L10.
Function name getBatchUser signals batch hot path.
→ REV-009 [Medium] formally reported.

A grep pattern doesn't get its attention crowded out by High-severity findings. That's the point.

One design principle worth noting: the protocol uses a wide-net strategy — prefer false-positive grep hits over false-negative misses. A false positive costs one extra semantic confirmation. A false negative means the issue is permanently gone. Pattern design should err toward broader matches.

Cost vs. Quality

Approach	Simple style PR	Complex concurrency PR	Full-scope refactor
All 7 agents (no triage)	~$0.16	~$0.16	~$0.16
Triage + on-demand dispatch	~$0.02	~$0.07	~$0.10
Original single skill	~$0.03	~$0.03 (but misses)	~$0.03 (but misses)

On simple PRs, triage saves ~87% of cost versus running everything. On complex PRs, cost is comparable to the full fleet — but quality is significantly better than the single-skill approach. The triage cost itself (Level 1 file-type grep + Level 2 fast model diff scan) runs under $0.001 per call — negligible.

The Broader Principle

We learned something that has changed how we think about AI engineering decisions:

For multi-dimensional tasks, the limiting factor is not model capability — it's context organization.

Opus in a polluted context loses to Sonnet in an isolated one. More compute applied to the wrong architecture doesn't solve the problem; it makes it more expensive. You don't need to wait for the next-generation model to fix systematic misses — architecture refactoring works on the models you already have, and it's more controllable and more predictable than hoping a stronger model pays more attention.

The decision framework shifts: from "which model should I use?" to "how should I structure the task so each agent only needs to succeed at one thing?"

The Implementation Is Open Source

Everything described in this article is published and deployable. The directory layout:

skills/
├── go-review-lead/SKILL.md           # orchestration logic — runs in main conversation
├── go-security-review/SKILL.md       # SQL injection, XSS, key leakage, permissions
├── go-concurrency-review/SKILL.md    # races, goroutine leaks, deadlocks, WaitGroup
│   └── references/go-concurrency-patterns.md
├── go-performance-review/SKILL.md    # pre-allocation, N+1, indexes, memory
│   └── references/go-performance-patterns.md
├── go-error-review/SKILL.md          # error wrapping, resource close, panic handling
├── go-quality-review/SKILL.md        # naming, structure, lint rules
├── go-test-review/SKILL.md           # coverage, assertion quality, test isolation
└── go-logic-review/SKILL.md          # business logic, boundaries, nil propagation

.claude/agents/                       # 7 vertical worker agents — drop in and use
├── go-security-reviewer.md
├── go-concurrency-reviewer.md
├── go-performance-reviewer.md
├── go-error-reviewer.md
├── go-quality-reviewer.md
├── go-test-reviewer.md
└── go-logic-reviewer.md

To deploy: copy the skills/ directories to ~/.claude/skills/ (user-level) or .claude/skills/ (project-level), copy the agent definition files to .claude/agents/, then invoke go-review-lead from the main conversation. The deployment guide with prerequisites and usage examples is at outputexample/go-review-lead/README.md.

The full methodology — skill design, quantitative A/B evaluation, golden test fixtures, zero-LLM regression tests, and the iteration framework that produced these results — is at:

github.com/johnqtcg/awesome-skills

The 29 production-ready skills and 42 paired evaluation reports (EN + ZH) are the examples; the methodology is the deliverable.

If you've hit the same wall — model keeps missing things despite explicit rules — the diagnosis is probably the same one we found. The context window is not infinitely attentive. Architecture is the lever that prompts can't reach.

DEV Community