Bolting an LLM onto your pull requests is a weekend project. Building AI code review that your engineers don't disable within two weeks is the actual problem. The failure mode isn't missing bugs — it's crying wolf. Post twenty nitpicks and three hallucinations on someone's PR and they'll mute the bot forever. This is the pipeline we built on Mattrx to earn — and keep — that trust.
Mattrx is our multi-tenant marketing-analytics SaaS: ~95k lines of C#, 11 engineers, and enough pull requests that senior-reviewer time was the bottleneck. We tried the naive thing first — pipe the changed file into a model, post the output — and watched the team stop reading it in nine days.
TL;DR
| Dimension | Human-only / naive AI (before) | AI review pipeline (after) |
|---|---|---|
| Coverage | selective / whole-file dump | every PR, diff-focused |
| First-review latency | ~6 hours (wait for a human) | ~3 minutes (AI first pass) |
| Context | none / a naked file | diff + call sites + conventions |
| Reviewers | one mega-prompt | specialized dimensions, in parallel |
| False positives | ~35% (so it gets ignored) | ~6% (adversarially verified) |
| Merge control | human, or nothing | severity gate; human always decides |
| Governance | none | gateway: audit, cost, secret redaction |
- ~90 PRs/week across 11 engineers; the pipeline reviews 100%.
- First-pass review latency 6h → 3 min.
- False-positive rate ~35% → ~6% — the single number that decides whether the bot lives or dies.
- Escaped defects to production down ~40%; senior-reviewer time down ~30%.
- ~$0.05 per PR (cheap model for style, frontier only for correctness).
The one mental shift: AI code review is not about finding issues — models find plenty. It's about not crying wolf. The product is trust, and trust is a false-positive-rate problem. Verify before you comment; let the AI propose and the human dispose.
The naive approach — and why it collapses
// BEFORE: dump the whole changed file into one prompt, post whatever comes back.
foreach (var file in pr.ChangedFiles)
{
var text = await File.ReadAllTextAsync(file.Path, ct);
var review = await model.CompleteAsync($"Review this code and list problems:\n{text}", ct);
await github.PostCommentAsync(pr, review); // a wall of unstructured, often-wrong text
}
It reviews the whole file, not the change. It has no project context, so it flags your conventions as bugs. No severity — a missing null-check and a stylistic preference arrive with equal weight. And no verification, so every hallucination goes straight to the developer. The result is a ~35% false-positive rate and a team that learns, correctly, to ignore the bot.
1. Context assembly — review the change, not the file
Build a review context: the diff (only what changed), the call sites of the symbols the change touches, and the project conventions for those files.
public async Task<ReviewContext> BuildAsync(PullRequest pr, CancellationToken ct)
{
var diff = await git.GetDiffAsync(pr.BaseSha, pr.HeadSha, ct); // the change, nothing else
var ctx = new ReviewContext { Diff = diff };
foreach (var file in diff.ChangedFiles)
{
ctx.AddCallSites(await symbols.FindReferencesAsync(file.TouchedSymbols, ct)); // bugs hide at call sites
ctx.AddConventions(conventions.ForPath(file.Path)); // your rules
}
return ctx; // diff + call sites + conventions — never a naked file
}
Most false positives are the model not knowing the rules of your codebase. Feed it the conventions and the call sites and it stops flagging your patterns and starts catching the bug two callers away.
2. Multi-dimensional reviewers, not one mega-prompt
Specialized reviewers — correctness, security, performance, tests — each with a narrow remit, run in parallel and return typed, structured findings:
public sealed record ReviewFinding(
string Dimension, // "correctness" | "security" | "performance" | "tests"
string File, int Line,
Severity Severity, // Blocker | High | Medium | Low | Nit
string Summary, // one sentence
string Rationale, // why it's a defect, grounded in the diff
string? SuggestedFix);
A "security reviewer" told to hunt injection and secret leakage outperforms a generalist told to "find problems," and its output is a typed record you can gate on — not a paragraph you have to parse.
3. Adversarial verification — the feature that earns trust
Before any finding is posted, a separate model is prompted to refute it. Default to "not real" when uncertain.
public async Task<bool> IsRealAsync(ReviewFinding f, ReviewContext ctx, CancellationToken ct)
{
var verdict = await gateway.EvaluateAsync(new EvalRequest
{
Feature = "code-review-verify",
Prompt =
$"A reviewer claims: \"{f.Summary}\". Using the diff and the call sites, decide " +
"whether this is a REAL defect that would bite in production. Actively try to " +
"refute it. If it depends on facts not present in the context, treat it as NOT real.",
Context = ctx.ForFinding(f),
}, ct);
return verdict.IsReal && verdict.Confidence >= 0.90; // post only if a skeptic couldn't refute it
}
This asymmetry is the whole game. Precision matters far more than recall for an AI reviewer, because the cost of a false positive is the tool itself getting muted. A skeptical second pass is the cheapest precision you'll ever buy — it's what took us to ~6% FP and kept the bot alive.
4. Severity gating — a human on the button
The AI proposes; the human disposes. Only blocker/high findings request changes; everything else is a non-blocking comment, and a human can always override.
public MergeAdvice Gate(IReadOnlyList<ReviewFinding> findings)
{
var blocking = findings.Where(f => f.Severity is Severity.Blocker or Severity.High).ToList();
return blocking.Count == 0
? MergeAdvice.Comment(findings) // post comments, do not block
: MergeAdvice.RequestChanges(blocking, findings); // request changes; human may override
}
An AI that can unilaterally block merges will, the first time it's confidently wrong, get switched off — taking its real value with it. Advisory-by-default with human override is what makes it safe to leave on.
5. Governance — run it through the gateway
Every review call goes through the same governed AI gateway: per-repo token budgets, model routing (cheap model for style, frontier for correctness), secret redaction before code leaves the boundary, and an append-only audit. Code is one of your most sensitive assets — if your AI reviewer isn't redacting secrets, capping spend, and logging what it saw, you've traded a review bottleneck for a data-governance incident.
6. The feedback loop
Developers thumbs-up/down every comment; dimensions with poor precision get stricter verification thresholds, and conventions that keep getting mis-flagged get added to the context. That loop is why precision stays high after launch instead of drifting.
The honest stuff: when NOT to build this
- Small team / low PR volume. If a human reviews everything within the hour, the overhead isn't worth it.
- You haven't measured false positives. Ship a noisy bot and you train your team to ignore it permanently. Pilot, measure FP, roll out under ~10%.
- You'd let the AI block merges alone. Don't. AI proposes, humans dispose.
- Proprietary/regulated code that can't leave your boundary. Self-host or redact aggressively.
- You think it replaces reviewers. It's an assistant — architecture and design stay human.
- You're using it for style. A linter does style deterministically, instantly, and free. Aim the AI at logic and security.
The model to carry forward
An AI reviewer's job is to delete the noise so humans review what matters. The models can find issues all day; the engineering is in not crying wolf. Optimize for precision over recall, verify before you comment, and keep the human on the merge button. Get the false-positive rate low enough and the tool becomes something your team relies on; get it wrong and they'll mute it in nine days — we timed it.
Originally published on PrepStack. Rolling out AI code review and fighting the false-positive problem? Reach me at randhir.jassal[at]gmail.com.
Top comments (0)