DEV Community

Cover image for How I Built a Multi-Agent Code Review Pipeline
GDS K S
GDS K S

Posted on

How I Built a Multi-Agent Code Review Pipeline

I was mass-approving PRs at 11pm on a Thursday. Not reading them. Just scrolling, checking if tests passed, hitting approve. We've all been there. That's when I decided to build something that would at least catch the obvious stuff before a human ever looked at it.

TL;DR

What How Why it matters
Agent 1: Style checker Runs on PR open via GitHub Actions Catches formatting, naming, lint issues before review
Agent 2: Logic reviewer Reads diff + context files Flags potential bugs, missing edge cases
Agent 3: Security scanner Pattern matches against OWASP top 10 Catches SQL injection, XSS, hardcoded secrets
Orchestrator Coordinates agents, posts summary comment One clean summary instead of three noisy bots

The problem with single-agent review

Most people who try AI code review start with one big prompt. Something like "review this PR for quality, security, and style." It works okay for small diffs. For anything over 200 lines, the output gets generic. You get the same five comments every time: "consider adding error handling," "this could be more descriptive," "consider edge cases." Not useful.

The fix is the same as with human teams. You don't ask one person to check style, logic, and security simultaneously. You split the work.

The architecture

Three specialized agents, one orchestrator. Each agent gets a narrow job and specific instructions about what to flag and what to ignore.

# .github/workflows/ai-review.yml
name: AI Code Review
on:
  pull_request:
    types: [opened, synchronize]

jobs:
  ai-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Get PR diff
        id: diff
        run: |
          git diff origin/${{ github.base_ref }}...HEAD > pr_diff.patch
          echo "diff_file=pr_diff.patch" >> $GITHUB_OUTPUT

      - name: Run review agents
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
        run: |
          node scripts/run-review.js \
            --diff ${{ steps.diff.outputs.diff_file }} \
            --pr ${{ github.event.pull_request.number }}
Enter fullscreen mode Exit fullscreen mode

Agent 1: The style checker

This one is simple. It reads the diff and checks against your team's conventions. Not linting (your CI already does that). More like "we use early returns, not nested ifs" or "we name boolean variables with is/has/should prefixes."

const styleAgent = {
  model: "claude-haiku-4-5-20251001",
  system: `You review code diffs for style consistency.
Rules:
- Early returns over nested conditionals
- Boolean vars start with is/has/should/can
- Max function length: 40 lines
- No default exports

Only flag violations. Do not suggest improvements
beyond the rules listed. If the diff follows all rules,
respond with "No style issues found."`,

  reviewDiff: async (diff: string) => {
    const response = await anthropic.messages.create({
      model: "claude-haiku-4-5-20251001",
      max_tokens: 1024,
      system: styleAgent.system,
      messages: [{ role: "user", content: `Review this diff:\n${diff}` }],
    });
    return parseFindings(response);
  },
};
Enter fullscreen mode Exit fullscreen mode

I use Haiku for this one. It's fast, cheap, and style checking doesn't need deep reasoning. About $0.002 per review.

Agent 2: The logic reviewer

This is the one that actually needs to think. It reads the diff plus surrounding context (the full files being modified) and looks for logic bugs.

const logicAgent = {
  model: "claude-sonnet-4-6-20250514",
  system: `You review code for logical correctness.
Focus on:
- Off-by-one errors
- Null/undefined not handled
- Race conditions in async code
- Missing error propagation
- State mutations in unexpected places

Ignore: style, formatting, naming.
For each issue, cite the exact line and explain
what could go wrong with a concrete example.`,

  reviewWithContext: async (diff: string, contextFiles: string[]) => {
    const fileContents = await Promise.all(
      contextFiles.map((f) => fs.readFile(f, "utf-8"))
    );

    const context = contextFiles
      .map((f, i) => `--- ${f} ---\n${fileContents[i]}`)
      .join("\n\n");

    const response = await anthropic.messages.create({
      model: "claude-sonnet-4-6-20250514",
      max_tokens: 2048,
      system: logicAgent.system,
      messages: [
        {
          role: "user",
          content: `Context files:\n${context}\n\nDiff to review:\n${diff}`,
        },
      ],
    });
    return parseFindings(response);
  },
};
Enter fullscreen mode Exit fullscreen mode

Sonnet for this one. It catches things that surprise me. Last week it flagged a race condition in a WebSocket handler where two messages could arrive between an async read and write. I'd have missed that.

Agent 3: The security scanner

Pattern matching against common vulnerabilities. Not a replacement for a real security audit, but it catches the dumb stuff.

const securityAgent = {
  model: "claude-sonnet-4-6-20250514",
  system: `You scan code diffs for security vulnerabilities.
Check for:
- SQL injection (string concatenation in queries)
- XSS (unescaped user input in HTML)
- Hardcoded secrets, API keys, passwords
- Path traversal (user input in file paths)
- Insecure deserialization
- Missing auth checks on new endpoints

Severity levels: CRITICAL, HIGH, MEDIUM, LOW.
Only report findings with a severity level.
No general advice.`,
};
Enter fullscreen mode Exit fullscreen mode

The orchestrator

The orchestrator runs all three agents concurrently, deduplicates findings, and posts a single comment on the PR.

async function runReview(diffFile: string, prNumber: number) {
  const diff = await fs.readFile(diffFile, "utf-8");
  const changedFiles = parseDiffFiles(diff);

  // Run all agents in parallel
  const [styleResults, logicResults, securityResults] = await Promise.all([
    styleAgent.reviewDiff(diff),
    logicAgent.reviewWithContext(diff, changedFiles),
    securityAgent.reviewDiff(diff),
  ]);

  const allFindings = [
    ...styleResults.map((f) => ({ ...f, category: "Style" })),
    ...logicResults.map((f) => ({ ...f, category: "Logic" })),
    ...securityResults.map((f) => ({ ...f, category: "Security" })),
  ];

  // Deduplicate findings on the same line
  const deduped = deduplicateByLine(allFindings);

  // Post as single PR comment
  const comment = formatReviewComment(deduped);
  await postPRComment(prNumber, comment);

  // Fail the check if any CRITICAL or HIGH severity
  const hasBlockers = deduped.some((f) =>
    ["CRITICAL", "HIGH"].includes(f.severity)
  );
  if (hasBlockers) process.exit(1);
}
Enter fullscreen mode Exit fullscreen mode

What I learned after running this for two months

The false positive rate was brutal at first. Around 40% of findings were noise. Two things fixed it:

1. Negative examples. I added a section to each agent's system prompt with examples of things NOT to flag. "Do not flag missing null checks on values that TypeScript's strict mode already guarantees." That cut false positives in half.

2. Feedback loop. When a reviewer dismisses an AI finding, I log it. Every two weeks I review the dismissed findings and update the prompts. The system gets better because it learns from what your team actually cares about.

After tuning, false positives dropped to about 12%. The agents now catch 2-3 real issues per week that would have made it to production. Mostly null handling edge cases and missing auth middleware on new routes.

Cost

Agent Model Avg cost per review Reviews/month
Style Haiku 4.5 $0.002 120
Logic Sonnet 4.6 $0.04 120
Security Sonnet 4.6 $0.03 120
Total ~$8.60/month

That's cheaper than one missed bug in production.

The bottom line

Split your AI review into specialized agents instead of one big prompt. Give each agent a narrow job, concrete rules, and examples of what not to flag. Run them in parallel. Post one clean summary. Tune the prompts every two weeks based on dismissed findings.

The full source code is on my GitHub. Link in the comments.


I build developer tools at Glincker where we think a lot about how AI fits into real engineering workflows.


Top comments (0)