How I Built a Multi-Agent Code Review Pipeline
I was mass-approving PRs at 11pm on a Thursday. Not reading them. Just scrolling, checking if tests passed, hitting approve. We've all been there. That's when I decided to build something that would at least catch the obvious stuff before a human ever looked at it.
TL;DR
| What | How | Why it matters |
|---|---|---|
| Agent 1: Style checker | Runs on PR open via GitHub Actions | Catches formatting, naming, lint issues before review |
| Agent 2: Logic reviewer | Reads diff + context files | Flags potential bugs, missing edge cases |
| Agent 3: Security scanner | Pattern matches against OWASP top 10 | Catches SQL injection, XSS, hardcoded secrets |
| Orchestrator | Coordinates agents, posts summary comment | One clean summary instead of three noisy bots |
The problem with single-agent review
Most people who try AI code review start with one big prompt. Something like "review this PR for quality, security, and style." It works okay for small diffs. For anything over 200 lines, the output gets generic. You get the same five comments every time: "consider adding error handling," "this could be more descriptive," "consider edge cases." Not useful.
I realized I was making the same mistake with AI that bad managers make with people. You don't ask one person to check style, logic, and security at the same time. You split the work.
Single agent approach:
PR Diff ──► [ One Big Agent ] ──► Generic comments
"consider error handling"
"this could be more descriptive"
"consider edge cases"
(same 5 comments every time)
Multi-agent approach:
┌─► [ Style Agent ] ──► "Line 42: nested if, use early return"
│
PR Diff ──┼─► [ Logic Agent ] ──► "Line 87: null ref if user.org is undefined"
│
└─► [ Security Agent] ──► "Line 23: CRITICAL - SQL string concat"
│
▼
[ Orchestrator ] ──► Single PR comment, deduplicated
The architecture
Three specialized agents, one orchestrator. Each agent gets a narrow job and specific instructions about what to flag and what to ignore.
┌──────────────────────────────────────────────────────────┐
│ GitHub Actions │
│ │
│ PR Opened / Synchronized │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ Get Diff │ │
│ └────┬─────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Orchestrator (Node.js) │ │
│ │ │ │
│ │ Promise.all([ │ │
│ │ ┌─────────┐ ┌─────────┐ ┌──────────────┐ │ │
│ │ │ Style │ │ Logic │ │ Security │ │ │
│ │ │ Haiku │ │ Sonnet │ │ Sonnet │ │ │
│ │ │ $0.002 │ │ $0.04 │ │ $0.03 │ │ │
│ │ └────┬────┘ └────┬────┘ └──────┬───────┘ │ │
│ │ │ │ │ │ │
│ │ ▼ ▼ ▼ │ │
│ │ ┌────────────────────────────────────────┐ │ │
│ │ │ Deduplicate + Format + Post Comment │ │ │
│ │ └────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Exit 1 if CRITICAL/HIGH findings (blocks merge) │
└──────────────────────────────────────────────────────────┘
The GitHub Actions workflow:
# .github/workflows/ai-review.yml
name: AI Code Review
on:
pull_request:
types: [opened, synchronize]
jobs:
ai-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get PR diff
id: diff
run: |
git diff origin/${{ github.base_ref }}...HEAD > pr_diff.patch
echo "diff_file=pr_diff.patch" >> $GITHUB_OUTPUT
- name: Run review agents
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
node scripts/run-review.js \
--diff ${{ steps.diff.outputs.diff_file }} \
--pr ${{ github.event.pull_request.number }}
The style checker (cheapest agent, runs on Haiku)
This one is boring on purpose. It reads the diff and checks against your team's conventions. Not linting (your CI already does that). More like "we use early returns, not nested ifs" or "we name boolean variables with is/has/should prefixes."
const styleAgent = {
model: "claude-haiku-4-5-20251001",
system: `You review code diffs for style consistency.
Rules:
- Early returns over nested conditionals
- Boolean vars start with is/has/should/can
- Max function length: 40 lines
- No default exports
Only flag violations. Do not suggest improvements
beyond the rules listed. If the diff follows all rules,
respond with "No style issues found."`,
reviewDiff: async (diff: string) => {
const response = await anthropic.messages.create({
model: "claude-haiku-4-5-20251001",
max_tokens: 1024,
system: styleAgent.system,
messages: [{ role: "user", content: `Review this diff:\n${diff}` }],
});
return parseFindings(response);
},
};
I use Haiku for this one. It's fast, cheap, and style checking doesn't need deep reasoning. About $0.002 per review.
The logic reviewer (this is the one that earns its keep)
It reads the diff plus surrounding context (the full files being modified) and looks for actual bugs. This is where Haiku won't cut it.
What the logic agent sees:
┌─────────────────────────────────┐
│ Context: Full files being │
│ modified (so it understands │
│ the surrounding code) │
│ │
│ ┌─────────────────────────────┐ │
│ │ Diff: Only the changed │ │
│ │ lines (what it's reviewing) │ │
│ └─────────────────────────────┘ │
│ │
│ Focus areas: │
│ ○ Off-by-one errors │
│ ○ Null/undefined not handled │
│ ○ Race conditions in async │
│ ○ Missing error propagation │
│ ○ Unexpected state mutations │
└─────────────────────────────────┘
const logicAgent = {
model: "claude-sonnet-4-6-20250514",
system: `You review code for logical correctness.
Focus on:
- Off-by-one errors
- Null/undefined not handled
- Race conditions in async code
- Missing error propagation
- State mutations in unexpected places
Ignore: style, formatting, naming.
For each issue, cite the exact line and explain
what could go wrong with a concrete example.`,
reviewWithContext: async (diff: string, contextFiles: string[]) => {
const fileContents = await Promise.all(
contextFiles.map((f) => fs.readFile(f, "utf-8"))
);
const context = contextFiles
.map((f, i) => `--- ${f} ---\n${fileContents[i]}`)
.join("\n\n");
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6-20250514",
max_tokens: 2048,
system: logicAgent.system,
messages: [
{
role: "user",
content: `Context files:\n${context}\n\nDiff to review:\n${diff}`,
},
],
});
return parseFindings(response);
},
};
Sonnet for this one. It catches things that surprise me. Last week it flagged a race condition in a WebSocket handler where two messages could arrive between an async read and write. I'd have missed that.
The security scanner
Honestly this one is the simplest. Pattern matching against common vulnerabilities. Not a replacement for a real security audit, but it catches the dumb stuff that slips through at 11pm.
const securityAgent = {
model: "claude-sonnet-4-6-20250514",
system: `You scan code diffs for security vulnerabilities.
Check for:
- SQL injection (string concatenation in queries)
- XSS (unescaped user input in HTML)
- Hardcoded secrets, API keys, passwords
- Path traversal (user input in file paths)
- Insecure deserialization
- Missing auth checks on new endpoints
Severity levels: CRITICAL, HIGH, MEDIUM, LOW.
Only report findings with a severity level.
No general advice.`,
};
The severity levels matter. The orchestrator uses them to decide whether to block the merge. CRITICAL and HIGH = merge blocked. MEDIUM and LOW = warning in the comment but merge allowed.
Severity routing:
Finding
│
├── CRITICAL ──► Block merge + notify Slack
├── HIGH ──► Block merge
├── MEDIUM ──► Warning in PR comment
└── LOW ──► Info note in PR comment
The orchestrator
The orchestrator runs all three agents concurrently, deduplicates findings, and posts a single comment on the PR.
async function runReview(diffFile: string, prNumber: number) {
const diff = await fs.readFile(diffFile, "utf-8");
const changedFiles = parseDiffFiles(diff);
// Run all agents in parallel
const [styleResults, logicResults, securityResults] = await Promise.all([
styleAgent.reviewDiff(diff),
logicAgent.reviewWithContext(diff, changedFiles),
securityAgent.reviewDiff(diff),
]);
const allFindings = [
...styleResults.map((f) => ({ ...f, category: "Style" })),
...logicResults.map((f) => ({ ...f, category: "Logic" })),
...securityResults.map((f) => ({ ...f, category: "Security" })),
];
// Deduplicate findings on the same line
const deduped = deduplicateByLine(allFindings);
// Post as single PR comment
const comment = formatReviewComment(deduped);
await postPRComment(prNumber, comment);
// Fail the check if any CRITICAL or HIGH severity
const hasBlockers = deduped.some((f) =>
["CRITICAL", "HIGH"].includes(f.severity)
);
if (hasBlockers) process.exit(1);
}
Here's what the PR comment actually looks like:
┌──────────────────────────────────────────────────────────┐
│ 🔍 AI Code Review Summary │
│ │
│ Findings: 4 total (1 HIGH, 2 MEDIUM, 1 LOW) │
│ Status: ❌ BLOCKED (has HIGH severity findings) │
│ │
├──────────────────────────────────────────────────────────┤
│ │
│ [HIGH] Security: SQL injection risk │
│ src/api/users.ts:23 │
│ String concatenation in query. Use parameterized query. │
│ │
│ [MEDIUM] Logic: Possible null reference │
│ src/services/org.ts:87 │
│ user.org accessed without null check. If user has no │
│ org, this throws at runtime. │
│ │
│ [MEDIUM] Style: Nested conditional │
│ src/handlers/auth.ts:42 │
│ 3 levels of nesting. Use early return pattern. │
│ │
│ [LOW] Style: Boolean naming │
│ src/handlers/auth.ts:55 │
│ Variable "active" should use is/has/should prefix. │
│ │
└──────────────────────────────────────────────────────────┘
What I learned after running this for two months
The false positive rate was brutal at first. Around 40% of findings were noise. Two things fixed it.
Fix 1: Negative examples in prompts
I added a section to each agent's system prompt with examples of things NOT to flag.
Before: "Flag null/undefined not handled"
Agent flags: const name = user.name; ← TypeScript strict mode
already guarantees this is defined
After: "Flag null/undefined not handled.
Do NOT flag when TypeScript strict mode
guarantees the value (non-optional types)."
Agent correctly ignores it.
That one change did more than anything else I tried. False positives dropped by roughly half.
Fix 2: Feedback loop
Week 1: Week 8:
┌────────────────┐ ┌────────────────┐
│ Findings: 47 │ │ Findings: 31 │
│ Useful: 28 │ │ Useful: 27 │
│ Noise: 19 │ │ Noise: 4 │
│ │ │ │
│ FP rate: 40% │ │ FP rate: 12% │
└────────────────┘ └────────────────┘
How: When a reviewer dismisses a finding, I log it.
Every 2 weeks I review dismissed findings and update
the prompts. The system learns from what your team
actually cares about.
After tuning, the agents catch 2-3 real issues per week that would have made it to production. Mostly null handling edge cases and missing auth middleware on new routes.
Cost breakdown
| Agent | Model | Cost per review | Reviews/month | Monthly |
|---|---|---|---|---|
| Style | Haiku 4.5 | $0.002 | 120 | $0.24 |
| Logic | Sonnet 4.6 | $0.04 | 120 | $4.80 |
| Security | Sonnet 4.6 | $0.03 | 120 | $3.60 |
| Total | $8.64 |
$8.64/month vs. one bug in production
The last auth bypass we caught would have been
a security incident. Our incident response process
costs roughly $2,000 in engineer-hours per event.
ROI: ~230x on the first prevented incident alone.
What I'd do differently if I started over
I wasted the first two weeks trying to get one agent to do everything. Don't bother. Start with the three-agent split from day one. The other thing I'd change: I'd add negative examples to the prompts immediately instead of waiting for the false positive rate to get bad enough to annoy me.
If you take one thing from this article:
"Review this PR" ← bad (generic output)
"Check this diff for ← good (specific, actionable)
race conditions in
async code. Cite the
exact line. Explain
what breaks."
The full source code is on my GitHub. Link in the comments.
I'm building profClaw (AI agent engine) and AskVerdict (multi-model AI verdicts) at Glincker. More at thegdsks.com
Top comments (0)