GDS K S

Posted on Apr 13

How I Built a Multi-Agent Code Review Pipeline

#ai #github #tutorial #productivity

I was mass-approving PRs at 11pm on a Thursday. Not reading them. Just scrolling, checking if tests passed, hitting approve. We've all been there. That's when I decided to build something that would at least catch the obvious stuff before a human ever looked at it.

TL;DR

What	How	Why it matters
Agent 1: Style checker	Runs on PR open via GitHub Actions	Catches formatting, naming, lint issues before review
Agent 2: Logic reviewer	Reads diff + context files	Flags potential bugs, missing edge cases
Agent 3: Security scanner	Pattern matches against OWASP top 10	Catches SQL injection, XSS, hardcoded secrets
Orchestrator	Coordinates agents, posts summary comment	One clean summary instead of three noisy bots

The problem with single-agent review

Most people who try AI code review start with one big prompt. Something like "review this PR for quality, security, and style." It works okay for small diffs. For anything over 200 lines, the output gets generic. You get the same five comments every time: "consider adding error handling," "this could be more descriptive," "consider edge cases." Not useful.

The fix is the same as with human teams. You don't ask one person to check style, logic, and security simultaneously. You split the work.

The architecture

Three specialized agents, one orchestrator. Each agent gets a narrow job and specific instructions about what to flag and what to ignore.

# .github/workflows/ai-review.yml
name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get PR diff
        id: diff
        run: |
          git diff origin/${{ github.base_ref }}...HEAD > pr_diff.patch
          echo "diff_file=pr_diff.patch" >> $GITHUB_OUTPUT

      - name: Run review agents
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          node scripts/run-review.js \
            --diff ${{ steps.diff.outputs.diff_file }} \
            --pr ${{ github.event.pull_request.number }}

Agent 1: The style checker

This one is simple. It reads the diff and checks against your team's conventions. Not linting (your CI already does that). More like "we use early returns, not nested ifs" or "we name boolean variables with is/has/should prefixes."

const styleAgent = {
  model: "claude-haiku-4-5-20251001",
  system: `You review code diffs for style consistency.
Rules:
- Early returns over nested conditionals
- Boolean vars start with is/has/should/can
- Max function length: 40 lines
- No default exports

Only flag violations. Do not suggest improvements
beyond the rules listed. If the diff follows all rules,
respond with "No style issues found."`,

  reviewDiff: async (diff: string) => {
    const response = await anthropic.messages.create({
      model: "claude-haiku-4-5-20251001",
      max_tokens: 1024,
      system: styleAgent.system,
      messages: [{ role: "user", content: `Review this diff:\n${diff}` }],
    });
    return parseFindings(response);
  },
};

I use Haiku for this one. It's fast, cheap, and style checking doesn't need deep reasoning. About $0.002 per review.

Agent 2: The logic reviewer

This is the one that actually needs to think. It reads the diff plus surrounding context (the full files being modified) and looks for logic bugs.

const logicAgent = {
  model: "claude-sonnet-4-6-20250514",
  system: `You review code for logical correctness.
Focus on:
- Off-by-one errors
- Null/undefined not handled
- Race conditions in async code
- Missing error propagation
- State mutations in unexpected places

Ignore: style, formatting, naming.
For each issue, cite the exact line and explain
what could go wrong with a concrete example.`,

  reviewWithContext: async (diff: string, contextFiles: string[]) => {
    const fileContents = await Promise.all(
      contextFiles.map((f) => fs.readFile(f, "utf-8"))
    );

    const context = contextFiles
      .map((f, i) => `--- ${f} ---\n${fileContents[i]}`)
      .join("\n\n");

    const response = await anthropic.messages.create({
      model: "claude-sonnet-4-6-20250514",
      max_tokens: 2048,
      system: logicAgent.system,
      messages: [
        {
          role: "user",
          content: `Context files:\n${context}\n\nDiff to review:\n${diff}`,
        },
      ],
    });
    return parseFindings(response);
  },
};

Sonnet for this one. It catches things that surprise me. Last week it flagged a race condition in a WebSocket handler where two messages could arrive between an async read and write. I'd have missed that.

Agent 3: The security scanner

Pattern matching against common vulnerabilities. Not a replacement for a real security audit, but it catches the dumb stuff.

const securityAgent = {
  model: "claude-sonnet-4-6-20250514",
  system: `You scan code diffs for security vulnerabilities.
Check for:
- SQL injection (string concatenation in queries)
- XSS (unescaped user input in HTML)
- Hardcoded secrets, API keys, passwords
- Path traversal (user input in file paths)
- Insecure deserialization
- Missing auth checks on new endpoints

Severity levels: CRITICAL, HIGH, MEDIUM, LOW.
Only report findings with a severity level.
No general advice.`,
};

The orchestrator

The orchestrator runs all three agents concurrently, deduplicates findings, and posts a single comment on the PR.

async function runReview(diffFile: string, prNumber: number) {
  const diff = await fs.readFile(diffFile, "utf-8");
  const changedFiles = parseDiffFiles(diff);

  // Run all agents in parallel
  const [styleResults, logicResults, securityResults] = await Promise.all([
    styleAgent.reviewDiff(diff),
    logicAgent.reviewWithContext(diff, changedFiles),
    securityAgent.reviewDiff(diff),
  ]);

  const allFindings = [
    ...styleResults.map((f) => ({ ...f, category: "Style" })),
    ...logicResults.map((f) => ({ ...f, category: "Logic" })),
    ...securityResults.map((f) => ({ ...f, category: "Security" })),
  ];

  // Deduplicate findings on the same line
  const deduped = deduplicateByLine(allFindings);

  // Post as single PR comment
  const comment = formatReviewComment(deduped);
  await postPRComment(prNumber, comment);

  // Fail the check if any CRITICAL or HIGH severity
  const hasBlockers = deduped.some((f) =>
    ["CRITICAL", "HIGH"].includes(f.severity)
  );
  if (hasBlockers) process.exit(1);
}

What I learned after running this for two months

The false positive rate was brutal at first. Around 40% of findings were noise. Two things fixed it:

1. Negative examples. I added a section to each agent's system prompt with examples of things NOT to flag. "Do not flag missing null checks on values that TypeScript's strict mode already guarantees." That cut false positives in half.

2. Feedback loop. When a reviewer dismisses an AI finding, I log it. Every two weeks I review the dismissed findings and update the prompts. The system gets better because it learns from what your team actually cares about.

After tuning, false positives dropped to about 12%. The agents now catch 2-3 real issues per week that would have made it to production. Mostly null handling edge cases and missing auth middleware on new routes.

Cost

Agent	Model	Avg cost per review	Reviews/month
Style	Haiku 4.5	$0.002	120
Logic	Sonnet 4.6	$0.04	120
Security	Sonnet 4.6	$0.03	120
Total			~$8.60/month

That's cheaper than one missed bug in production.

The bottom line

Split your AI review into specialized agents instead of one big prompt. Give each agent a narrow job, concrete rules, and examples of what not to flag. Run them in parallel. Post one clean summary. Tune the prompts every two weeks based on dismissed findings.

The full source code is on my GitHub. Link in the comments.

I build developer tools at Glincker where we think a lot about how AI fits into real engineering workflows.

DEV Community