I was mass-approving PRs at 11pm on a Thursday. Not reading them. Just scrolling, checking if tests passed, hitting approve. We've all been there. That's when I decided to build something that would at least catch the obvious stuff before a human ever looked at it.
TL;DR
| What | How | Why it matters |
|---|---|---|
| Agent 1: Style checker | Runs on PR open via GitHub Actions | Catches formatting, naming, lint issues before review |
| Agent 2: Logic reviewer | Reads diff + context files | Flags potential bugs, missing edge cases |
| Agent 3: Security scanner | Pattern matches against OWASP top 10 | Catches SQL injection, XSS, hardcoded secrets |
| Orchestrator | Coordinates agents, posts summary comment | One clean summary instead of three noisy bots |
The problem with single-agent review
Most people who try AI code review start with one big prompt. Something like "review this PR for quality, security, and style." It works okay for small diffs. For anything over 200 lines, the output gets generic. You get the same five comments every time: "consider adding error handling," "this could be more descriptive," "consider edge cases." Not useful.
The fix is the same as with human teams. You don't ask one person to check style, logic, and security simultaneously. You split the work.
The architecture
Three specialized agents, one orchestrator. Each agent gets a narrow job and specific instructions about what to flag and what to ignore.
# .github/workflows/ai-review.yml
name: AI Code Review
on:
pull_request:
types: [opened, synchronize]
jobs:
ai-review:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Get PR diff
id: diff
run: |
git diff origin/${{ github.base_ref }}...HEAD > pr_diff.patch
echo "diff_file=pr_diff.patch" >> $GITHUB_OUTPUT
- name: Run review agents
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
node scripts/run-review.js \
--diff ${{ steps.diff.outputs.diff_file }} \
--pr ${{ github.event.pull_request.number }}
Agent 1: The style checker
This one is simple. It reads the diff and checks against your team's conventions. Not linting (your CI already does that). More like "we use early returns, not nested ifs" or "we name boolean variables with is/has/should prefixes."
const styleAgent = {
model: "claude-haiku-4-5-20251001",
system: `You review code diffs for style consistency.
Rules:
- Early returns over nested conditionals
- Boolean vars start with is/has/should/can
- Max function length: 40 lines
- No default exports
Only flag violations. Do not suggest improvements
beyond the rules listed. If the diff follows all rules,
respond with "No style issues found."`,
reviewDiff: async (diff: string) => {
const response = await anthropic.messages.create({
model: "claude-haiku-4-5-20251001",
max_tokens: 1024,
system: styleAgent.system,
messages: [{ role: "user", content: `Review this diff:\n${diff}` }],
});
return parseFindings(response);
},
};
I use Haiku for this one. It's fast, cheap, and style checking doesn't need deep reasoning. About $0.002 per review.
Agent 2: The logic reviewer
This is the one that actually needs to think. It reads the diff plus surrounding context (the full files being modified) and looks for logic bugs.
const logicAgent = {
model: "claude-sonnet-4-6-20250514",
system: `You review code for logical correctness.
Focus on:
- Off-by-one errors
- Null/undefined not handled
- Race conditions in async code
- Missing error propagation
- State mutations in unexpected places
Ignore: style, formatting, naming.
For each issue, cite the exact line and explain
what could go wrong with a concrete example.`,
reviewWithContext: async (diff: string, contextFiles: string[]) => {
const fileContents = await Promise.all(
contextFiles.map((f) => fs.readFile(f, "utf-8"))
);
const context = contextFiles
.map((f, i) => `--- ${f} ---\n${fileContents[i]}`)
.join("\n\n");
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6-20250514",
max_tokens: 2048,
system: logicAgent.system,
messages: [
{
role: "user",
content: `Context files:\n${context}\n\nDiff to review:\n${diff}`,
},
],
});
return parseFindings(response);
},
};
Sonnet for this one. It catches things that surprise me. Last week it flagged a race condition in a WebSocket handler where two messages could arrive between an async read and write. I'd have missed that.
Agent 3: The security scanner
Pattern matching against common vulnerabilities. Not a replacement for a real security audit, but it catches the dumb stuff.
const securityAgent = {
model: "claude-sonnet-4-6-20250514",
system: `You scan code diffs for security vulnerabilities.
Check for:
- SQL injection (string concatenation in queries)
- XSS (unescaped user input in HTML)
- Hardcoded secrets, API keys, passwords
- Path traversal (user input in file paths)
- Insecure deserialization
- Missing auth checks on new endpoints
Severity levels: CRITICAL, HIGH, MEDIUM, LOW.
Only report findings with a severity level.
No general advice.`,
};
The orchestrator
The orchestrator runs all three agents concurrently, deduplicates findings, and posts a single comment on the PR.
async function runReview(diffFile: string, prNumber: number) {
const diff = await fs.readFile(diffFile, "utf-8");
const changedFiles = parseDiffFiles(diff);
// Run all agents in parallel
const [styleResults, logicResults, securityResults] = await Promise.all([
styleAgent.reviewDiff(diff),
logicAgent.reviewWithContext(diff, changedFiles),
securityAgent.reviewDiff(diff),
]);
const allFindings = [
...styleResults.map((f) => ({ ...f, category: "Style" })),
...logicResults.map((f) => ({ ...f, category: "Logic" })),
...securityResults.map((f) => ({ ...f, category: "Security" })),
];
// Deduplicate findings on the same line
const deduped = deduplicateByLine(allFindings);
// Post as single PR comment
const comment = formatReviewComment(deduped);
await postPRComment(prNumber, comment);
// Fail the check if any CRITICAL or HIGH severity
const hasBlockers = deduped.some((f) =>
["CRITICAL", "HIGH"].includes(f.severity)
);
if (hasBlockers) process.exit(1);
}
What I learned after running this for two months
The false positive rate was brutal at first. Around 40% of findings were noise. Two things fixed it:
1. Negative examples. I added a section to each agent's system prompt with examples of things NOT to flag. "Do not flag missing null checks on values that TypeScript's strict mode already guarantees." That cut false positives in half.
2. Feedback loop. When a reviewer dismisses an AI finding, I log it. Every two weeks I review the dismissed findings and update the prompts. The system gets better because it learns from what your team actually cares about.
After tuning, false positives dropped to about 12%. The agents now catch 2-3 real issues per week that would have made it to production. Mostly null handling edge cases and missing auth middleware on new routes.
Cost
| Agent | Model | Avg cost per review | Reviews/month |
|---|---|---|---|
| Style | Haiku 4.5 | $0.002 | 120 |
| Logic | Sonnet 4.6 | $0.04 | 120 |
| Security | Sonnet 4.6 | $0.03 | 120 |
| Total | ~$8.60/month |
That's cheaper than one missed bug in production.
The bottom line
Split your AI review into specialized agents instead of one big prompt. Give each agent a narrow job, concrete rules, and examples of what not to flag. Run them in parallel. Post one clean summary. Tune the prompts every two weeks based on dismissed findings.
The full source code is on my GitHub. Link in the comments.
I build developer tools at Glincker where we think a lot about how AI fits into real engineering workflows.
Top comments (0)