<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: jin li</title>
    <description>The latest articles on DEV Community by jin li (@ybzdqhl).</description>
    <link>https://dev.to/ybzdqhl</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3865050%2Fddac0ce2-1c19-4403-8d19-ecb432a5e445.png</url>
      <title>DEV Community: jin li</title>
      <link>https://dev.to/ybzdqhl</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ybzdqhl"/>
    <language>en</language>
    <item>
      <title>Architecture Over Model: How We Got 13/13 Bug Detection Without Upgrading to a Stronger AI</title>
      <dc:creator>jin li</dc:creator>
      <pubDate>Tue, 07 Apr 2026 06:31:46 +0000</pubDate>
      <link>https://dev.to/ybzdqhl/architecture-over-model-how-we-got-1313-bug-detection-without-upgrading-to-a-stronger-ai-4fb9</link>
      <guid>https://dev.to/ybzdqhl/architecture-over-model-how-we-got-1313-bug-detection-without-upgrading-to-a-stronger-ai-4fb9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;A story about attention dilution, architectural reasoning, and the counterintuitive fix that finally worked.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;You've spent weeks refining your AI code-review skill. You've added explicit rules. You've rewritten the checklist. You've added mandatory language: &lt;em&gt;"Execute ALL checklist categories regardless of how many High findings have already been identified."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The next week, a Medium-severity performance issue slips through again.&lt;/p&gt;

&lt;p&gt;The model had found 4 High-severity concurrency bugs in the same function. It was warned. The rule was right there in its context. It did it anyway.&lt;/p&gt;

&lt;p&gt;Here's the hard truth we learned after many rounds of iteration: &lt;strong&gt;you're not dealing with a prompting problem. You're dealing with an architecture problem.&lt;/strong&gt; And no amount of prompt engineering will fix an architecture problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bug That Kept Getting Missed
&lt;/h2&gt;

&lt;p&gt;Consider this Go function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;getBatchUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;userKeys&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;UserKey&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;userList&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="nb"&gt;make&lt;/span&gt;&lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;User&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt; &lt;span class="n"&gt;sync&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WaitGroup&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;userKeys&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;wg&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Done&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetGuest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WarnContextf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"no found guest user: %v"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="n"&gt;userList&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;userList&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;userList&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are multiple issues here. Our code-review skill correctly found the four obvious High-severity ones — compile errors, data race, goroutine leak, loop variable capture. But the full issue count, once all rounds of validation completed, was 13. The single skill captured 8. That's a &lt;strong&gt;62% detection rate&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The five missed findings included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No &lt;code&gt;defer recover()&lt;/code&gt; in goroutine&lt;/strong&gt; — an unhandled panic inside &lt;code&gt;go func()&lt;/code&gt; terminates the entire process, not just the goroutine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unbounded goroutine spawning&lt;/strong&gt; — goroutine count scales linearly with &lt;code&gt;len(userKeys)&lt;/code&gt; with no semaphore, rate limit, or worker pool; a large batch exhausts memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Missing &lt;code&gt;wg.Wait()&lt;/code&gt;&lt;/strong&gt; — the function returns before any goroutine completes, making &lt;code&gt;userList&lt;/code&gt; always empty and the return value meaningless&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Slice without pre-allocation&lt;/strong&gt; — &lt;code&gt;make([]*User, 0)&lt;/code&gt; with a known upper bound &lt;code&gt;len(userKeys)&lt;/code&gt; causes repeated reallocation in the hot path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Silent error discard&lt;/strong&gt; — errors are logged but not propagated; the caller receives &lt;code&gt;nil, nil&lt;/code&gt; and has no way to know which users failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these were exotic edge cases. All were in the skill's explicit checklist. When we pointed out the most instructive miss — slice pre-allocation — the model acknowledged it immediately:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"After 4 High-severity concurrent defects consumed my attention, I was not careful enough walking through the Performance checklist and mistakenly categorized this as 'a minor issue that can be ignored' without formally reporting it."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model &lt;em&gt;knew&lt;/em&gt; the rule. It had the checklist. It still didn't apply it. This pattern — High-severity findings crowding out Medium-severity ones across multiple dimensions — reproduced consistently across test cases. It wasn't random variance. It was structural.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Prompts Can't Fix This
&lt;/h2&gt;

&lt;p&gt;When an AI model handles 5 review dimensions in a single call, all that knowledge coexists in one context window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Single Agent's Context Window]
┌─────────────────────────────────────────┐
│ Security rules (SQL injection, ...)     │
│ Concurrency rules (races, leaks...)     │
│ Performance rules (pre-alloc, ...)      │  ← squeezed out
│ Error-handling rules (wrap, nil...)     │  ← squeezed out
│ Quality rules (naming, structure...)    │  ← squeezed out
│                                         │
│ Findings found so far:                  │
│ ├── [High] compile error    ←─────┐     │
│ ├── [High] data race        ←─────┤ attention here
│ ├── [High] goroutine leak   ←─────┤     │
│ └── [High] loop capture     ←─────┘     │
│                                         │
│ Performance checklist:                  │
│   Slice Pre-allocation → ??? (skipped)  │  ← insufficient attention
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We tried every reasonable prompt fix:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mitigation&lt;/th&gt;
&lt;th&gt;Effect&lt;/th&gt;
&lt;th&gt;Limitation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"Execute ALL checklist categories regardless of High findings"&lt;/td&gt;
&lt;td&gt;Partially effective&lt;/td&gt;
&lt;td&gt;The rule itself competes for attention in the same context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory note: "High findings must not cause skipping"&lt;/td&gt;
&lt;td&gt;Helps next session&lt;/td&gt;
&lt;td&gt;Does not fix multi-dimension competition in the current call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stronger mandatory language + repetition&lt;/td&gt;
&lt;td&gt;Limited improvement&lt;/td&gt;
&lt;td&gt;LLM attention allocation is probabilistic; instructions can't override it&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each fix reduced the miss rate somewhat. None eliminated it. The ceiling was about 67% for the model-execution class of misses — documented across multiple real cases. The remaining 33% persisted no matter how strongly we phrased the instruction.&lt;/p&gt;

&lt;p&gt;This is not a prompting problem. The model's attention is finite and shared across everything in the context window. When High findings accumulate, they dominate attention at inference time. This is structural.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Multi-Agent Is the Right Direction
&lt;/h2&gt;

&lt;p&gt;The evolution here mirrors what happened in software engineering when monolithic codebases grew too large to maintain:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Software Evolution&lt;/th&gt;
&lt;th&gt;AI Agent Evolution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Monolith codebase too large to maintain&lt;/td&gt;
&lt;td&gt;Single agent context window accumulates too much, performance degrades&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single-point failure affects the whole system&lt;/td&gt;
&lt;td&gt;One dimension's High findings contaminate the entire review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cannot scale modules independently&lt;/td&gt;
&lt;td&gt;Cannot choose optimal model per task type&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Responsibility boundaries blurry&lt;/td&gt;
&lt;td&gt;Agent role confusion degrades output quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Just as large monolithic applications eventually need microservices, a monolithic agent needs vertical specialization when the task is complex enough.&lt;/p&gt;

&lt;p&gt;A Multi-Agent architecture means multiple AI agents collaborate under clear role assignments — &lt;strong&gt;each with its own context window, a dedicated toolset, and well-defined responsibilities&lt;/strong&gt;. For Go code review, this maps to four concrete advantages:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Advantage&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;What it means here&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Focused context window&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Each sub-agent runs in a fresh, clean context uncontaminated by other dimensions' findings&lt;/td&gt;
&lt;td&gt;Concurrency finding 4 High issues does not affect Performance's sensitivity to &lt;code&gt;make([]*User, 0)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deep specialization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Each agent's prompt focuses on a single domain with a minimal toolset&lt;/td&gt;
&lt;td&gt;Security agent sees only security defects; no need to juggle five dimensions at once&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Multi-perspective quality assurance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multiple agents evaluate independently, unaware of each other's findings&lt;/td&gt;
&lt;td&gt;Cross-dimension cross-validation, not just serial checklists&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Flexible model assignment&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lead uses a stronger model for triage and aggregation; workers use faster models for review&lt;/td&gt;
&lt;td&gt;Triage + deduplication with Sonnet; workers with Haiku to control cost&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Anthropic's internal research provides quantitative support: in the BrowseComp benchmark, &lt;strong&gt;token usage alone explained 80% of performance variance&lt;/strong&gt; across agents. The key factor wasn't model capability — it was how much "clean context" each agent had to work with. Context contamination degrades single-agent performance in a measurable, predictable way.&lt;/p&gt;




&lt;h2&gt;
  
  
  Choosing the Right Orchestration Pattern
&lt;/h2&gt;

&lt;p&gt;Once you've decided to go Multi-Agent, the next question is: which orchestration pattern?&lt;/p&gt;

&lt;p&gt;Anthropic defines five foundational patterns. We evaluated all five against the Go code-review scenario before settling on one:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Pattern&lt;/th&gt;
&lt;th&gt;Core Mechanism&lt;/th&gt;
&lt;th&gt;Assessment for this scenario&lt;/th&gt;
&lt;th&gt;Fit?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1. Prompt Chaining&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Linear step sequence; each step's output feeds the next&lt;/td&gt;
&lt;td&gt;Security/concurrency/performance dimensions have no sequential dependencies — not a sequencing problem&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2. Routing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Classify input, route to one specialized handler&lt;/td&gt;
&lt;td&gt;A single review must cover multiple dimensions simultaneously, not pick one&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3. Parallelization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multiple parallel paths; subtasks &lt;strong&gt;fixed at design time&lt;/strong&gt;
&lt;/td&gt;
&lt;td&gt;Close to what's needed, but fixed subtasks mean all branches always run — can't prune based on content&lt;/td&gt;
&lt;td&gt;△&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4. Orchestrator-Workers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Central orchestrator &lt;strong&gt;dynamically&lt;/strong&gt; decomposes tasks, dispatches workers on demand&lt;/td&gt;
&lt;td&gt;Best match — review dimensions are determined by code content at runtime&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5. Evaluator-Optimizer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Generate → evaluate → refine iterative loop&lt;/td&gt;
&lt;td&gt;Code review is a diagnostic task, not an iterative generation task&lt;/td&gt;
&lt;td&gt;✗&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key distinction is between Pattern 3 and Pattern 4. Both support parallelism. The difference is where subtasks come from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Parallelization (Pattern 3):
  Code → [Fixed dispatch: Security + Performance + Quality + Logic + ...] → Aggregate
  Subtasks are fixed at design time; every review runs all N paths

Orchestrator-Workers (Pattern 4):
  Code → [Lead Agent analyzes diff] → Dynamic decision → Dispatch K paths (K ≤ N) → Aggregate
  Subtasks are decided at runtime based on code content
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Which agents to dispatch depends on what the code actually contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code only renames variables → Quality + Logic (2 agents)&lt;/li&gt;
&lt;li&gt;Code introduces &lt;code&gt;go func&lt;/code&gt; + &lt;code&gt;sync.WaitGroup&lt;/code&gt; → also Concurrency + Error (4 agents)&lt;/li&gt;
&lt;li&gt;Code contains &lt;code&gt;make([]*T, 0)&lt;/code&gt; + batch function names → also Performance (5 agents)&lt;/li&gt;
&lt;li&gt;Code has &lt;code&gt;_test.go&lt;/code&gt; changes → also Test (6 agents)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This "content-driven dimension selection" cannot be known at design time. The orchestrator must decide dynamically at runtime — exactly the scenario Anthropic defines as the Orchestrator-Workers applicable case: &lt;em&gt;"Cannot predict which subtasks will be needed in advance; the Orchestrator must decide dynamically based on input."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Forcing Pattern 3 means launching all 7 agents on every review. A 5-line variable rename incurs the same token cost as a full concurrency + security audit. Triage is the orchestrator's core value.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Skill-Agent Collaboration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architecture Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                      PR Diff / Code Snippet
                             │
                             ↓
              [Main conversation + go-review-lead Skill]
                   Role: triage + dispatch + aggregation
                   Does NOT load vertical review Skills
                   Does NOT directly review code
                             │
                     Phases 1-4: Triage
                     grep + pattern matching → which dimensions?
                             │
    ┌────────┬────────┬───────┼───────┬────────┬────────┐
    ↓        ↓        ↓       ↓       ↓        ↓        ↓
[Security][Concurr][Perf] [Error] [Quality] [Test] [Logic]
  Agent    Agent   Agent  Agent   Agent    Agent   Agent
    │        │       │      │       │        │       │
  Load     Load    Load   Load    Load     Load    Load
 security concurr  perf  error  quality   test   logic
  Skill    Skill   Skill  Skill   Skill    Skill   Skill
    │        │       │      │       │        │       │
 Review   Review  Review Review  Review   Review  Review
          independently in each clean context
    └────────┴────────┴───────┴───────┴────────┴────────┘
                             │
                             ↓
                  Main conversation aggregates
              Merge findings + deduplicate + sort by severity
                             │
                             ↓
                         Final report
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three architecture options were considered. Two simpler ones failed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Characteristics&lt;/th&gt;
&lt;th&gt;Known problems&lt;/th&gt;
&lt;th&gt;Recommended?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;A: Single Skill&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1 agent, all review knowledge, one call&lt;/td&gt;
&lt;td&gt;Attention dilution; High findings suppress other dimensions; proven misses&lt;/td&gt;
&lt;td&gt;Basic scenarios only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;B: Multi-Agent, no Skills&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7 agents, prompt-only, no Skills loaded&lt;/td&gt;
&lt;td&gt;Clean context, but no domain review rules; relies on AI general knowledge&lt;/td&gt;
&lt;td&gt;Not recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;C: Multi-Agent + vertical Skills&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Main conversation orchestrates via Skill; 7 workers each load one domain Skill&lt;/td&gt;
&lt;td&gt;Slightly higher design cost&lt;/td&gt;
&lt;td&gt;✅ Recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Core Design Principles
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Principle 1: Each agent loads exactly one dimension's Skill.&lt;/strong&gt;&lt;br&gt;
A Performance Agent's context contains only performance-related knowledge and the code under review — no other dimensions' rules, no other agents' findings. This significantly raises the probability that the model focuses its attention on the Performance checklist.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Principle 2: The orchestrator does not review code.&lt;/strong&gt;&lt;br&gt;
After loading &lt;code&gt;go-review-lead&lt;/code&gt;, the main conversation acts as a neutral coordinator — triage and aggregation only. If the orchestrator also reviewed code, its own findings would bias its aggregation of workers' results, recreating the same attention-competition problem as the heavy single skill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Principle 3: The orchestration logic must be a Skill in the main conversation, not an agent definition.&lt;/strong&gt;&lt;br&gt;
Claude Code subagents cannot spawn other subagents. If &lt;code&gt;go-review-lead&lt;/code&gt; were configured as an agent definition file, its parallel dispatch calls to the 7 vertical agents would be silently ignored — they'd degrade to serial execution or not run at all. The orchestration Skill runs in the main conversation, not in &lt;code&gt;.claude/agents/&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Back to the Case: Why Misses Are Less Likely
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;th&gt;Context contains&lt;/th&gt;
&lt;th&gt;What it finds in this case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Concurrency Agent&lt;/td&gt;
&lt;td&gt;only concurrency rules + code&lt;/td&gt;
&lt;td&gt;4 High findings (races, leaks, loop capture, &lt;code&gt;wg.Wait()&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance Agent&lt;/td&gt;
&lt;td&gt;only performance rules + code&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;make([]*User, 0)&lt;/code&gt; pre-allocation miss — significantly less likely to be crowded out&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error Agent&lt;/td&gt;
&lt;td&gt;only error-handling rules + code&lt;/td&gt;
&lt;td&gt;silent error discard, &lt;code&gt;continue&lt;/code&gt; inside goroutine&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quality Agent&lt;/td&gt;
&lt;td&gt;only quality rules + code&lt;/td&gt;
&lt;td&gt;unused variables, naming issues&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logic Agent&lt;/td&gt;
&lt;td&gt;only logic rules + code&lt;/td&gt;
&lt;td&gt;return contract violation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lead (orchestrator)&lt;/td&gt;
&lt;td&gt;only the 5 structured reports&lt;/td&gt;
&lt;td&gt;merge, deduplicate, sort by severity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Performance Agent does not need to be aware of those 4 High concurrency bugs. Its context holds only the Performance checklist and the code. For checklist items like Slice Pre-allocation, this isolation substantially reduces the risk of attention being captured by severity-dominant findings from other dimensions. It does not make misses impossible — but it makes them significantly less likely and more systematic to diagnose when they do occur.&lt;/p&gt;


&lt;h2&gt;
  
  
  Skills and Agents: Not the Same Thing
&lt;/h2&gt;

&lt;p&gt;One concept worth clarifying before you implement this, because it's easy to conflate them:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Agent&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;What it is&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A knowledge package — workflow, checklist, reference rules&lt;/td&gt;
&lt;td&gt;An execution unit — an LLM instance with its own context window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Answers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"What to do" and "how to do it"&lt;/td&gt;
&lt;td&gt;"Who does it" and "where it runs"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lives in&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;SKILL.md&lt;/code&gt; + &lt;code&gt;references/&lt;/code&gt; + &lt;code&gt;scripts/&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Agent definition file (role + tools + which Skill to load)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Core value&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encodes domain expertise; makes AI exceed general capability&lt;/td&gt;
&lt;td&gt;Provides execution isolation; each task runs in a clean context&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A Skill without an Agent is expertise that still runs in a polluted context. An Agent without a Skill is isolation without domain knowledge. Architecture C combines both.&lt;/p&gt;

&lt;p&gt;At runtime, an agent loads its Skill on demand — it does not copy the Skill's content into its own definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Performance Agent (independent context window) starts
  │
  │ Reads its definition: "load go-performance-review Skill"
  │ Calls Skill("go-performance-review")
  │                    ↓
  │         SKILL.md checklist and rules load into current context
  │
  │ Context now contains:
  │   ✓ Performance checklist (from Skill)
  │   ✓ Code under review
  │   ✗ Concurrency rules (not loaded)
  │   ✗ Other agents' findings (isolated)
  │
  ↓
  Executes review → returns structured result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Agent definition files stay lightweight (role + tools + which Skill to load). Skill files stay independent and reusable across agents. No duplication.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Counterintuitive Finding
&lt;/h2&gt;

&lt;p&gt;With the architecture designed, we ran a controlled experiment:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Findings captured&lt;/th&gt;
&lt;th&gt;Detection rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single Agent&lt;/td&gt;
&lt;td&gt;Opus 4&lt;/td&gt;
&lt;td&gt;8/13&lt;/td&gt;
&lt;td&gt;62%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Agent Orchestrator-Workers&lt;/td&gt;
&lt;td&gt;Sonnet 4 Workers&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13/13&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A fleet of mid-tier Sonnet agents outperformed a single top-tier Opus agent. Not because Sonnet is "smarter" — Opus genuinely outperforms Sonnet on a focused single task. The difference is task structure.&lt;/p&gt;

&lt;p&gt;When Opus handles 5 dimensions simultaneously, attention dilution systematically degrades its per-dimension performance. When Sonnet handles only &lt;em&gt;one&lt;/em&gt; dimension in a clean context, it operates near full focus with no cross-dimension competition. &lt;strong&gt;Sonnet × N focused agents can outperform Opus × 1 generalist agent&lt;/strong&gt; on multi-dimensional tasks.&lt;/p&gt;

&lt;p&gt;This changes the question you should be asking. The old question: &lt;em&gt;"Which is the most powerful model I should use?"&lt;/em&gt; The better question: &lt;em&gt;"Can I restructure my task so each agent only needs to excel at one thing?"&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Experiments, Three Lessons
&lt;/h2&gt;

&lt;p&gt;Before the final architecture stabilized, we ran three validation rounds on the same &lt;code&gt;getBatchUser&lt;/code&gt; function. Each round taught us something unexpected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 1: Single Skill Baseline
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Result: 8/13 (62%) — 5 findings missed&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The single &lt;code&gt;go-code-reviewer&lt;/code&gt; skill found 4 High-severity concurrency bugs but missed the full set of Medium-severity findings: slice pre-allocation, unbounded goroutine spawning, no panic recovery in goroutines, missing &lt;code&gt;wg.Wait()&lt;/code&gt;, and silent error discard. The model acknowledged the misses when prompted. Architecture refactoring was warranted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 2: Multi-Agent v1 (No Grep-Gated)
&lt;/h3&gt;

&lt;p&gt;We split the single skill into 7 vertical agents — Security, Concurrency, Performance, Error, Quality, Test, Logic — each running in a clean independent context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result: Unstable — sometimes captured, sometimes not&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Two new problems emerged:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 1: Triage blind spot.&lt;/strong&gt; The orchestrator's Phase 2 trigger for the Performance agent was looking for &lt;code&gt;make&lt;/code&gt; calls &lt;em&gt;with&lt;/em&gt; a capacity argument. &lt;code&gt;make([]*User, 0)&lt;/code&gt; — the case &lt;em&gt;without&lt;/em&gt; a capacity argument — was the very pattern we needed to catch. The trigger fired in reverse. The Performance agent was never dispatched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Problem 2: Within-dimension attention dilution.&lt;/strong&gt; Even with the Concurrency agent running in its own clean context, when that context contained 4 High-severity compile errors and data races, "unbounded goroutine creation" (Medium-severity) still got deprioritized. Isolated contexts solved &lt;em&gt;cross-dimension&lt;/em&gt; dilution. They did &lt;em&gt;not&lt;/em&gt; solve &lt;em&gt;within-dimension&lt;/em&gt; dilution when severity disparity was high enough.&lt;/p&gt;

&lt;p&gt;The actual report came back with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Skipped skills: go-performance-reviewer (no hot-path loops or DB patterns)
                              ← triage blind spot caused the skip

Residual Risk:
  Unbounded goroutine spawning: Not flagged as a finding since expected
  batch size is unknown          ← buried in Residual Risk, not formally reported
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Architecture refactoring alone wasn't enough. We needed both isolated contexts &lt;em&gt;and&lt;/em&gt; a mechanism that didn't depend on attention to walk the checklist.&lt;/p&gt;

&lt;h3&gt;
  
  
  Round 3: Multi-Agent + Grep-Gated
&lt;/h3&gt;

&lt;p&gt;Two fixes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fix the triage&lt;/strong&gt;: rewrite the Phase 2 trigger to detect zero-capacity &lt;code&gt;make&lt;/code&gt; explicitly, and add a Phase 3 heuristic — function names containing &lt;code&gt;Batch&lt;/code&gt;, &lt;code&gt;Multi&lt;/code&gt;, or &lt;code&gt;GetAll&lt;/code&gt; automatically trigger the Performance agent. &lt;code&gt;getBatchUser&lt;/code&gt; hits directly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Introduce the Grep-Gated protocol&lt;/strong&gt;: for checklist items with clear syntactic features, run a mechanical &lt;code&gt;grep&lt;/code&gt; scan &lt;em&gt;before&lt;/em&gt; asking the model to reason about them.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Result: 13/13 (100%) — stable across repeated runs&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Grep-Gated Execution Protocol
&lt;/h2&gt;

&lt;p&gt;This is the mechanism that made Round 3 work, and it's worth explaining carefully.&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;the model is not a human code reviewer. It has tools it can use.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Human reviewers scan with their eyes and rely on attention to find matches. We were asking the AI to do the same — use "attention" to walk through a checklist and search for matches. But &lt;code&gt;grep&lt;/code&gt; is deterministic. It doesn't get tired. It doesn't have "attention."&lt;/p&gt;

&lt;p&gt;The execution flow for each sub-agent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Load the domain Skill (checklist + rules + grep patterns)
2. Write code to $TMPDIR/review_snippet.go
3. For all grep-gated checklist items → run grep with patterns from the Skill
4. grep HIT  → model performs semantic confirmation (true positive vs false positive)
5. grep MISS → automatically mark NOT FOUND, skip semantic analysis
6. Items without grep patterns (pure semantic) → full model reasoning
7. Report only FOUND items
8. Audit line: "Grep pre-scan: X/Y items hit, Z confirmed"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Coverage across 7 skills, 86 checklist items:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Skill&lt;/th&gt;
&lt;th&gt;Total Items&lt;/th&gt;
&lt;th&gt;Grep-able&lt;/th&gt;
&lt;th&gt;Semantic Only&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;go-concurrency-review&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;go-performance-review&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;go-error-review&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;go-security-review&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;go-quality-review&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;go-test-review&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;go-logic-review&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;86&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;65 (75%)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;21 (25%)&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;75% of checklist items are now mechanically pre-scanned.&lt;/strong&gt; The model's attention is reserved for the remaining 25% of genuinely semantic items — and for confirming grep hits rather than searching for them.&lt;/p&gt;

&lt;p&gt;The slice pre-allocation miss, which survived two rounds of architecture improvements, was caught in Round 3 via:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Grep hit: make([]*User, 0) at L10.
Function name getBatchUser signals batch hot path.
→ REV-009 [Medium] formally reported.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A grep pattern doesn't get its attention crowded out by High-severity findings. That's the point.&lt;/p&gt;

&lt;p&gt;One design principle worth noting: the protocol uses a &lt;strong&gt;wide-net&lt;/strong&gt; strategy — prefer false-positive grep hits over false-negative misses. A false positive costs one extra semantic confirmation. A false negative means the issue is permanently gone. Pattern design should err toward broader matches.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost vs. Quality
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Simple style PR&lt;/th&gt;
&lt;th&gt;Complex concurrency PR&lt;/th&gt;
&lt;th&gt;Full-scope refactor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;All 7 agents (no triage)&lt;/td&gt;
&lt;td&gt;~$0.16&lt;/td&gt;
&lt;td&gt;~$0.16&lt;/td&gt;
&lt;td&gt;~$0.16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Triage + on-demand dispatch&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$0.02&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~$0.07&lt;/td&gt;
&lt;td&gt;~$0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Original single skill&lt;/td&gt;
&lt;td&gt;~$0.03&lt;/td&gt;
&lt;td&gt;~$0.03 (but misses)&lt;/td&gt;
&lt;td&gt;~$0.03 (but misses)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On simple PRs, triage saves ~87% of cost versus running everything. On complex PRs, cost is comparable to the full fleet — but quality is significantly better than the single-skill approach. The triage cost itself (Level 1 file-type grep + Level 2 fast model diff scan) runs under $0.001 per call — negligible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Broader Principle
&lt;/h2&gt;

&lt;p&gt;We learned something that has changed how we think about AI engineering decisions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For multi-dimensional tasks, the limiting factor is not model capability — it's context organization.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Opus in a polluted context loses to Sonnet in an isolated one. More compute applied to the wrong architecture doesn't solve the problem; it makes it more expensive. You don't need to wait for the next-generation model to fix systematic misses — architecture refactoring works on the models you already have, and it's more controllable and more predictable than hoping a stronger model pays more attention.&lt;/p&gt;

&lt;p&gt;The decision framework shifts: from "which model should I use?" to "how should I structure the task so each agent only needs to succeed at one thing?"&lt;/p&gt;




&lt;h2&gt;
  
  
  The Implementation Is Open Source
&lt;/h2&gt;

&lt;p&gt;Everything described in this article is published and deployable. The directory layout:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;skills/&lt;/span&gt;
&lt;span class="s"&gt;├── go-review-lead/SKILL.md&lt;/span&gt;           &lt;span class="c1"&gt;# orchestration logic — runs in main conversation&lt;/span&gt;
&lt;span class="s"&gt;├── go-security-review/SKILL.md&lt;/span&gt;       &lt;span class="c1"&gt;# SQL injection, XSS, key leakage, permissions&lt;/span&gt;
&lt;span class="s"&gt;├── go-concurrency-review/SKILL.md&lt;/span&gt;    &lt;span class="c1"&gt;# races, goroutine leaks, deadlocks, WaitGroup&lt;/span&gt;
&lt;span class="s"&gt;│   └── references/go-concurrency-patterns.md&lt;/span&gt;
&lt;span class="s"&gt;├── go-performance-review/SKILL.md&lt;/span&gt;    &lt;span class="c1"&gt;# pre-allocation, N+1, indexes, memory&lt;/span&gt;
&lt;span class="s"&gt;│   └── references/go-performance-patterns.md&lt;/span&gt;
&lt;span class="s"&gt;├── go-error-review/SKILL.md&lt;/span&gt;          &lt;span class="c1"&gt;# error wrapping, resource close, panic handling&lt;/span&gt;
&lt;span class="s"&gt;├── go-quality-review/SKILL.md&lt;/span&gt;        &lt;span class="c1"&gt;# naming, structure, lint rules&lt;/span&gt;
&lt;span class="s"&gt;├── go-test-review/SKILL.md&lt;/span&gt;           &lt;span class="c1"&gt;# coverage, assertion quality, test isolation&lt;/span&gt;
&lt;span class="s"&gt;└── go-logic-review/SKILL.md&lt;/span&gt;          &lt;span class="c1"&gt;# business logic, boundaries, nil propagation&lt;/span&gt;

&lt;span class="s"&gt;.claude/agents/&lt;/span&gt;                       &lt;span class="c1"&gt;# 7 vertical worker agents — drop in and use&lt;/span&gt;
&lt;span class="s"&gt;├── go-security-reviewer.md&lt;/span&gt;
&lt;span class="s"&gt;├── go-concurrency-reviewer.md&lt;/span&gt;
&lt;span class="s"&gt;├── go-performance-reviewer.md&lt;/span&gt;
&lt;span class="s"&gt;├── go-error-reviewer.md&lt;/span&gt;
&lt;span class="s"&gt;├── go-quality-reviewer.md&lt;/span&gt;
&lt;span class="s"&gt;├── go-test-reviewer.md&lt;/span&gt;
&lt;span class="s"&gt;└── go-logic-reviewer.md&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To deploy: copy the &lt;code&gt;skills/&lt;/code&gt; directories to &lt;code&gt;~/.claude/skills/&lt;/code&gt; (user-level) or &lt;code&gt;.claude/skills/&lt;/code&gt; (project-level), copy the agent definition files to &lt;code&gt;.claude/agents/&lt;/code&gt;, then invoke &lt;code&gt;go-review-lead&lt;/code&gt; from the main conversation. The deployment guide with prerequisites and usage examples is at &lt;code&gt;outputexample/go-review-lead/README.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The full methodology — skill design, quantitative A/B evaluation, golden test fixtures, zero-LLM regression tests, and the iteration framework that produced these results — is at:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/johnqtcg/awesome-skills" rel="noopener noreferrer"&gt;github.com/johnqtcg/awesome-skills&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The 29 production-ready skills and 42 paired evaluation reports (EN + ZH) are the examples; the methodology is the deliverable.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If you've hit the same wall — model keeps missing things despite explicit rules — the diagnosis is probably the same one we found. The context window is not infinitely attentive. Architecture is the lever that prompts can't reach.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>codequality</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
