GDS K S

Posted on Apr 13

How I Built a Multi-Agent Code Review Pipeline

#ai #github #tutorial #productivity

How I Built a Multi-Agent Code Review Pipeline

I was mass-approving PRs at 11pm on a Thursday. Not reading them. Just scrolling, checking if tests passed, hitting approve. We've all been there. That's when I decided to build something that would at least catch the obvious stuff before a human ever looked at it.

TL;DR

What	How	Why it matters
Agent 1: Style checker	Runs on PR open via GitHub Actions	Catches formatting, naming, lint issues before review
Agent 2: Logic reviewer	Reads diff + context files	Flags potential bugs, missing edge cases
Agent 3: Security scanner	Pattern matches against OWASP top 10	Catches SQL injection, XSS, hardcoded secrets
Orchestrator	Coordinates agents, posts summary comment	One clean summary instead of three noisy bots

The problem with single-agent review

Most people who try AI code review start with one big prompt. Something like "review this PR for quality, security, and style." It works okay for small diffs. For anything over 200 lines, the output gets generic. You get the same five comments every time: "consider adding error handling," "this could be more descriptive," "consider edge cases." Not useful.

I realized I was making the same mistake with AI that bad managers make with people. You don't ask one person to check style, logic, and security at the same time. You split the work.

Single agent approach:

  PR Diff ──► [ One Big Agent ] ──► Generic comments
                                     "consider error handling"
                                     "this could be more descriptive"
                                     "consider edge cases"
                                     (same 5 comments every time)

Multi-agent approach:

              ┌─► [ Style Agent  ] ──► "Line 42: nested if, use early return"
              │
  PR Diff ──┼─► [ Logic Agent  ] ──► "Line 87: null ref if user.org is undefined"
              │
              └─► [ Security Agent] ──► "Line 23: CRITICAL - SQL string concat"
              │
              ▼
        [ Orchestrator ] ──► Single PR comment, deduplicated

The architecture

Three specialized agents, one orchestrator. Each agent gets a narrow job and specific instructions about what to flag and what to ignore.

┌──────────────────────────────────────────────────────────┐
│                    GitHub Actions                         │
│                                                          │
│  PR Opened / Synchronized                                │
│       │                                                  │
│       ▼                                                  │
│  ┌──────────┐                                            │
│  │ Get Diff │                                            │
│  └────┬─────┘                                            │
│       │                                                  │
│       ▼                                                  │
│  ┌──────────────────────────────────────────────────┐    │
│  │              Orchestrator (Node.js)               │    │
│  │                                                   │    │
│  │   Promise.all([                                   │    │
│  │     ┌─────────┐  ┌─────────┐  ┌──────────────┐  │    │
│  │     │  Style  │  │  Logic  │  │   Security   │  │    │
│  │     │  Haiku  │  │  Sonnet │  │   Sonnet     │  │    │
│  │     │ $0.002  │  │  $0.04  │  │   $0.03      │  │    │
│  │     └────┬────┘  └────┬────┘  └──────┬───────┘  │    │
│  │          │            │              │           │    │
│  │          ▼            ▼              ▼           │    │
│  │     ┌────────────────────────────────────────┐   │    │
│  │     │  Deduplicate + Format + Post Comment   │   │    │
│  │     └────────────────────────────────────────┘   │    │
│  └──────────────────────────────────────────────────┘    │
│       │                                                  │
│       ▼                                                  │
│  Exit 1 if CRITICAL/HIGH findings (blocks merge)         │
└──────────────────────────────────────────────────────────┘

The GitHub Actions workflow:

# .github/workflows/ai-review.yml
name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get PR diff
        id: diff
        run: |
          git diff origin/${{ github.base_ref }}...HEAD > pr_diff.patch
          echo "diff_file=pr_diff.patch" >> $GITHUB_OUTPUT

      - name: Run review agents
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          node scripts/run-review.js \
            --diff ${{ steps.diff.outputs.diff_file }} \
            --pr ${{ github.event.pull_request.number }}

The style checker (cheapest agent, runs on Haiku)

This one is boring on purpose. It reads the diff and checks against your team's conventions. Not linting (your CI already does that). More like "we use early returns, not nested ifs" or "we name boolean variables with is/has/should prefixes."

const styleAgent = {
  model: "claude-haiku-4-5-20251001",
  system: `You review code diffs for style consistency.
Rules:
- Early returns over nested conditionals
- Boolean vars start with is/has/should/can
- Max function length: 40 lines
- No default exports

Only flag violations. Do not suggest improvements
beyond the rules listed. If the diff follows all rules,
respond with "No style issues found."`,

  reviewDiff: async (diff: string) => {
    const response = await anthropic.messages.create({
      model: "claude-haiku-4-5-20251001",
      max_tokens: 1024,
      system: styleAgent.system,
      messages: [{ role: "user", content: `Review this diff:\n${diff}` }],
    });
    return parseFindings(response);
  },
};

I use Haiku for this one. It's fast, cheap, and style checking doesn't need deep reasoning. About $0.002 per review.

The logic reviewer (this is the one that earns its keep)

It reads the diff plus surrounding context (the full files being modified) and looks for actual bugs. This is where Haiku won't cut it.

What the logic agent sees:

┌─────────────────────────────────┐
│ Context: Full files being       │
│ modified (so it understands     │
│ the surrounding code)           │
│                                 │
│ ┌─────────────────────────────┐ │
│ │ Diff: Only the changed      │ │
│ │ lines (what it's reviewing) │ │
│ └─────────────────────────────┘ │
│                                 │
│ Focus areas:                    │
│  ○ Off-by-one errors            │
│  ○ Null/undefined not handled   │
│  ○ Race conditions in async     │
│  ○ Missing error propagation    │
│  ○ Unexpected state mutations   │
└─────────────────────────────────┘

const logicAgent = {
  model: "claude-sonnet-4-6-20250514",
  system: `You review code for logical correctness.
Focus on:
- Off-by-one errors
- Null/undefined not handled
- Race conditions in async code
- Missing error propagation
- State mutations in unexpected places

Ignore: style, formatting, naming.
For each issue, cite the exact line and explain
what could go wrong with a concrete example.`,

  reviewWithContext: async (diff: string, contextFiles: string[]) => {
    const fileContents = await Promise.all(
      contextFiles.map((f) => fs.readFile(f, "utf-8"))
    );

    const context = contextFiles
      .map((f, i) => `--- ${f} ---\n${fileContents[i]}`)
      .join("\n\n");

    const response = await anthropic.messages.create({
      model: "claude-sonnet-4-6-20250514",
      max_tokens: 2048,
      system: logicAgent.system,
      messages: [
        {
          role: "user",
          content: `Context files:\n${context}\n\nDiff to review:\n${diff}`,
        },
      ],
    });
    return parseFindings(response);
  },
};

Sonnet for this one. It catches things that surprise me. Last week it flagged a race condition in a WebSocket handler where two messages could arrive between an async read and write. I'd have missed that.

The security scanner

Honestly this one is the simplest. Pattern matching against common vulnerabilities. Not a replacement for a real security audit, but it catches the dumb stuff that slips through at 11pm.

const securityAgent = {
  model: "claude-sonnet-4-6-20250514",
  system: `You scan code diffs for security vulnerabilities.
Check for:
- SQL injection (string concatenation in queries)
- XSS (unescaped user input in HTML)
- Hardcoded secrets, API keys, passwords
- Path traversal (user input in file paths)
- Insecure deserialization
- Missing auth checks on new endpoints

Severity levels: CRITICAL, HIGH, MEDIUM, LOW.
Only report findings with a severity level.
No general advice.`,
};

The severity levels matter. The orchestrator uses them to decide whether to block the merge. CRITICAL and HIGH = merge blocked. MEDIUM and LOW = warning in the comment but merge allowed.

Severity routing:

  Finding
    │
    ├── CRITICAL ──► Block merge + notify Slack
    ├── HIGH     ──► Block merge
    ├── MEDIUM   ──► Warning in PR comment
    └── LOW      ──► Info note in PR comment

The orchestrator

The orchestrator runs all three agents concurrently, deduplicates findings, and posts a single comment on the PR.

async function runReview(diffFile: string, prNumber: number) {
  const diff = await fs.readFile(diffFile, "utf-8");
  const changedFiles = parseDiffFiles(diff);

  // Run all agents in parallel
  const [styleResults, logicResults, securityResults] = await Promise.all([
    styleAgent.reviewDiff(diff),
    logicAgent.reviewWithContext(diff, changedFiles),
    securityAgent.reviewDiff(diff),
  ]);

  const allFindings = [
    ...styleResults.map((f) => ({ ...f, category: "Style" })),
    ...logicResults.map((f) => ({ ...f, category: "Logic" })),
    ...securityResults.map((f) => ({ ...f, category: "Security" })),
  ];

  // Deduplicate findings on the same line
  const deduped = deduplicateByLine(allFindings);

  // Post as single PR comment
  const comment = formatReviewComment(deduped);
  await postPRComment(prNumber, comment);

  // Fail the check if any CRITICAL or HIGH severity
  const hasBlockers = deduped.some((f) =>
    ["CRITICAL", "HIGH"].includes(f.severity)
  );
  if (hasBlockers) process.exit(1);
}

Here's what the PR comment actually looks like:

┌──────────────────────────────────────────────────────────┐
│ 🔍 AI Code Review Summary                                │
│                                                          │
│ Findings: 4 total (1 HIGH, 2 MEDIUM, 1 LOW)             │
│ Status: ❌ BLOCKED (has HIGH severity findings)           │
│                                                          │
├──────────────────────────────────────────────────────────┤
│                                                          │
│ [HIGH] Security: SQL injection risk                      │
│ src/api/users.ts:23                                      │
│ String concatenation in query. Use parameterized query.  │
│                                                          │
│ [MEDIUM] Logic: Possible null reference                  │
│ src/services/org.ts:87                                   │
│ user.org accessed without null check. If user has no     │
│ org, this throws at runtime.                             │
│                                                          │
│ [MEDIUM] Style: Nested conditional                       │
│ src/handlers/auth.ts:42                                  │
│ 3 levels of nesting. Use early return pattern.           │
│                                                          │
│ [LOW] Style: Boolean naming                              │
│ src/handlers/auth.ts:55                                  │
│ Variable "active" should use is/has/should prefix.       │
│                                                          │
└──────────────────────────────────────────────────────────┘

What I learned after running this for two months

The false positive rate was brutal at first. Around 40% of findings were noise. Two things fixed it.

Fix 1: Negative examples in prompts

I added a section to each agent's system prompt with examples of things NOT to flag.

Before: "Flag null/undefined not handled"
Agent flags: const name = user.name;  ← TypeScript strict mode
             already guarantees this is defined

After:  "Flag null/undefined not handled.
         Do NOT flag when TypeScript strict mode
         guarantees the value (non-optional types)."
Agent correctly ignores it.

That one change did more than anything else I tried. False positives dropped by roughly half.

Fix 2: Feedback loop

Week 1:                        Week 8:
┌────────────────┐             ┌────────────────┐
│ Findings: 47   │             │ Findings: 31   │
│ Useful:   28   │             │ Useful:   27   │
│ Noise:    19   │             │ Noise:     4   │
│                │             │                │
│ FP rate: 40%   │             │ FP rate: 12%   │
└────────────────┘             └────────────────┘

How: When a reviewer dismisses a finding, I log it.
Every 2 weeks I review dismissed findings and update
the prompts. The system learns from what your team
actually cares about.

After tuning, the agents catch 2-3 real issues per week that would have made it to production. Mostly null handling edge cases and missing auth middleware on new routes.

Cost breakdown

Agent	Model	Cost per review	Reviews/month	Monthly
Style	Haiku 4.5	$0.002	120	$0.24
Logic	Sonnet 4.6	$0.04	120	$4.80
Security	Sonnet 4.6	$0.03	120	$3.60
Total				$8.64

$8.64/month vs. one bug in production

The last auth bypass we caught would have been
a security incident. Our incident response process
costs roughly $2,000 in engineer-hours per event.

ROI: ~230x on the first prevented incident alone.

What I'd do differently if I started over

I wasted the first two weeks trying to get one agent to do everything. Don't bother. Start with the three-agent split from day one. The other thing I'd change: I'd add negative examples to the prompts immediately instead of waiting for the false positive rate to get bad enough to annoy me.

If you take one thing from this article:

  "Review this PR"           ← bad (generic output)

  "Check this diff for       ← good (specific, actionable)
   race conditions in
   async code. Cite the
   exact line. Explain
   what breaks."

The full source code is on my GitHub. Link in the comments.

I'm building profClaw (AI agent engine) and AskVerdict (multi-model AI verdicts) at Glincker. More at thegdsks.com

DEV Community

How I Built a Multi-Agent Code Review Pipeline

How I Built a Multi-Agent Code Review Pipeline

TL;DR

The problem with single-agent review

The architecture

The style checker (cheapest agent, runs on Haiku)

The logic reviewer (this is the one that earns its keep)

The security scanner

The orchestrator

What I learned after running this for two months

Fix 1: Negative examples in prompts

Fix 2: Feedback loop

Cost breakdown

What I'd do differently if I started over

Top comments (0)