DEV Community: Lewis

State of AI Code Review | April 2026 Recap

Lewis — Sat, 02 May 2026 17:19:57 +0000

Welcome to the first monthly roundup of the AI code review space.

April was the month that agentic code review tools stopped being a "nice to have."

A wave of high-profile software breaches hit security teams in rapid succession: Vercel, Vimeo, ADT, Medtronic, Autovista, TrueConf, were all hit, plus a fresh npm supply-chain attack on April 30.

Customer data, source code, internal credentials, and live infrastructure all spilled out in droves.

So what's driving the uptick?

New data this month points to one contributing culprit: the firehose of AI-written code flooding codebases faster than anyone can review it.

ProjectDiscovery's 2026 AI Coding Impact Report, published April 22, surveyed 200 cybersecurity practitioners.

100% reported increased engineering delivery in the past year, with 49% attributing most or all of it to AI-assisted coding.

Two-thirds spend more than half their time manually validating findings rather than fixing the underlying issues.

A separate Sherlock Forensics audit of 50 AI-built applications found 92% had critical-severity vulnerabilities.

The bottleneck has moved downstream, and manual verification doesn't scale to agentic-volume PR throughput.

That tailwind is what the major AI code review tools spent April competing into. CodeRabbit, Cursor Bugbot, and Macroscope all shipped or extended autonomous-fix capabilities. The product category is shifting from "find the bug" to "fix it without me asking."

But the same throughput that creates the demand also created the month's pricing fights.

Greptile's March price change ($30/seat for 50 reviews, then $1/review) sustained backlash through April, with named founders posting bills jumping from $30 to $500+ as their AI agents push 500+ PRs/month.

Below, a breakdown of the moves made by the six big code review players.

MACROSCOPE

1/ Agent.

Like other dedicated AI code reviewers, Macroscope identifies bugs in the background. You open a PR, it analyzes the diff against your wider codebase, and leaves review comments on GitHub with any issues it detects.

But because Macroscope already understands your codebase, it can do more than review PRs. It can use the codebase to answer questions about how your product works, ship fixes, or give leaders weekly status updates on what their team shipped and how the product is evolving.

Macroscope's "Agent" allows you to tap into these capabilities on demand — from Slack, GitHub, or via API/webhooks.

Agent also connects to other tools around your codebase — Slack, Jira, Sentry, PostHog, BigQuery, LaunchDarkly, and any MCP server — so you've got a codebase-aware agent that can ingest data and take actions across your entire stack.

This unlocks use cases like:

End-to-end fixes — "This Sentry error is spiking — find the cause, open a PR to fix it, and file a Jira ticket for Eliza to QA."
Cross-stack investigations — "Why did signups drop yesterday? Check Sentry, PostHog, and recent deploys."
Reporting for leaders — "Summarize what the platform team shipped this week and post it in #eng-updates every Friday."
Onboarding & docs — "Walk me through how billing works in this repo. Write an internal doc in Notion for new hires, and publish a customer-facing version to Mintlify."

2/ Agent pricing.

Previously bundled, Agent has its own usage-based pricing.

You pay the raw LLM cost + 5% markup.
1,000 free Agent credits per workspace (~$10 of usage), every month. Most light users won't pay anything extra.
Transparent. Admins can see per-run costs and export the full usage log as CSV from Settings → Billing.
Live for new signups now, rolled out to existing customers on April 27.

GREPTILE

1/ Brand refresh.

Greptile updated their visual identity this month — new logo, new illustration style.

2/ Fix in Devin.

Greptile review comments now have a Fix in Devin button. Click it, and the issue (file paths, line numbers, suggested code) gets handed to a Devin session that opens a PR with the fix.

CODERABBIT

1/ Agent for Slack (Early Access).

CodeRabbit joins Macroscope in extending their agent to Slack. You can investigate code, generate plans, and open PRs from any thread.

Connects to Jira, Linear, Notion, Sentry, Datadog, PagerDuty, Figma, Google Drive, AWS, GCP, and any MCP server.

Billed separately from CodeRabbit's review plans (in agent minutes, not seats).

2/ PR Usage-Based Add-On.

A pay-as-you-go toggle so teams that hit their plan's PR limit can keep shipping. Only over-limit reviews get charged.

3/ Autofix on GitLab.

Autofix (which commits AI-generated fixes to your branch or a stacked PR) is now available on GitLab, not just GitHub.

4/ CLI improvements.

v0.4.0 added browser sign-in and full — agent flag support for Codex, Cursor, Claude Code, OpenCode, etc. v0.4.1 added coderabbit stats.

CURSOR BUGBOT

1/ Learned rules.

Bugbot now tries to learn from PR feedback — downvotes on its comments, replies explaining what was wrong, and issues human reviewers flag that it missed.

It also tries to promote useful rules automatically and disable ones that stop working.

2/ MCP support.

Bugbot can now access MCP servers for additional context during reviews. Teams and Enterprise plans only.

3/ Fix All.

A new action that applies multiple Bugbot fixes in one go, instead of one-by-one.

CLAUDE CODE CODE REVIEW

1/ /ultrareview.

A new slash command that runs a deeper code review than the standard /review in the cloud.

It launches a fleet of agents in parallel that each verify findings before surfacing them. /ultrareview is designed to be better at handling complex codebases.

You trigger it manually (it doesn't run automatically), and it works on either your current branch diff (pre-PR) or a GitHub PR.

Reviews run in the background and take 5–20 minutes.

It's billed separately from regular Claude Code usage — $5–$20 per review depending on diff size. For context, this is still an order of magnitude more expensive than the dedicated usage-based code review tools on this list.

Pro and Max users get three free runs to try it (expiring May 5).

GITHUB COPILOT CODE REVIEW

No new features this month — just billing and admin changes.

1/ Billing change incoming (June 1).

Each Copilot code review on a private repo will start consuming GitHub Actions minutes on top of the AI Credits it already costs.

Public repos unaffected.

2/ Usage metrics.

Admins can now see active vs. passive Code Review users in the Copilot usage metrics, and those counts are now aggregated in the usage metrics API.

The State Of AI Code Review in April 2026

April in a nutshell: Detection is becoming table stakes, autonomous fixing is the new battleground, and per-seat pricing is starting to crack under agentic PR volume.

See you at the start of June with May's moves.

Coding agents broke code review. Two Claude Code skills help me fight back.

Lewis — Mon, 27 Apr 2026 20:10:40 +0000

Last week, I had an idea for a new app.

Forty eight hours later, Opus 4.7 and I submitted its v1 to the App Store.

Of course, I didn’t write a single line of code myself—a practice increasingly adopted by the world’s most respected engineers:

There's a tradeoff to shipping at the speed of English like this though. More bugs make it into production.

Earlier this year, Concordia researchers tracked 200,000 code units across 201 projects and found AI code gets bug-fixed at a higher rate than human code.

Amazon recently pulled engineers into a “deep dive” on a string of AWS outages, citing “novel GenAI usage for which best practices and safeguards are not yet fully established.”

Then a few weeks ago we saw Axios, Mercor, Railway, the Argentinian government, and more, suffer security breaches all within the span of a few days—how many were caused by rubber-stamped AI code?

Coding agents shift the bottleneck from writing code to reviewing it, an increasingly overwhelming task given the speed at which code can be generated.

So how are engineers responding to the PR-firehose? A growing number are simply dropping the ball on review. A new NAIST study of 1,664 agentic pull requests found 75% pass through with zero revisions requested. Three out of four AI-generated PRs ship without a single human change (despite them containing more bugs!).

Like those cited above, I too have been guilty of sacrificing thoroughness for speed. And like them, I too have paid a price.

I found that while I can get new app updates into the hands of users faster, time is wasted when they’re giving me feedback on bugs, and features that don’t work as they’re meant to.

I was losing much of the time I’d saved and burning precious feedback cycles.

Then two Claude Code skills leveled the playing field.

How to catch real bugs in in AI-generated code in six keystrokes

These two skills work in tandem with Macroscope, the AI code review tool.

They are designed to give Claude seamless integration with a dedicated code reviewer, so you can get frontier-quality bug detection without leaving the Claude Code interface, and without breaking the bank on Claude’s Code Review product.

Macroscope was the first major code review tool to ditch seat-based pricing in favour of usage-based—$0.05 per KB reviewed, and most reviews come in under a dollar. Claude Code reviews meanwhile clock in an order of magnitude higher at $15–25 a piece, for what I find to be worse performance.

I’ll start by showing you what the workflow looks like, then I’ll show you how to set it up.

Every time you submit a new PR to GitHub, Macroscope automatically reviews it. It almost always comes back with a few real bugs.

Once I get an alert that Macroscope has finished its review, I run the first skill /review-pr-comments in Claude Code.

Claude reads Macroscope's comments in GitHub, reviews the actual code each one is pointing at, and gives me a verdict on which are believed to be valid or invalid, with the evidence laid out.

From there, you can quiz Claude for further details. Or if you disagree with a verdict, just say so, and the next skill respects your override.

When I’m happy, I run the second skill /resolve-pr-comments.

Claude rejects the invalid GitHub comments with a brief explanation and resolves those threads.

Then it fixes the valid issues in your codebase one at a time.

Each fix gets a Fixed. reply in GitHub and a resolved thread.

Claude + Macroscope hitting it off ❤

When everything’s done, Claude re-reads all the changes together to make sure the fixes don’t conflict, runs the tests one final time, and reports back.

I commit, push, and Macroscope reviews the new commit. I repeat the loop until the checks pass.

The outcome

✅ Fewer bugs make it to production.

✅ Less time wasted with users.

✅ Less time asking Claude to track down the cause of issues days after you wrote the code.

And all in six keystrokes.

Why two code review skills are better than one

The first review skill doesn’t touch your PR.

That’s because an agent that judges and acts in the same step is more likely to begin implementing without due consideration.

The split forces a moment of judgment between analysis and action.

Setting up Claude Code with Macroscope

You’ll need three things before you start: Claude Code, a GitHub account, and the GitHub CLI.

(If you don’t have the CLI, just ask Claude to install it)

Step 1: Install Macroscope on your repo

Head to macroscope.com and click Sign up.

Authorize the GitHub OAuth flow.

You’ll land on a screen telling you Macroscope isn’t installed on any of your orgs yet. Click Install Macroscope.

Pick the repo (or repos) you want Macroscope to review. I went with “Only select repositories” and chose the one I’m shipping.

Then just go through the Stripe checkout to claim your free trial and you’re done.

Macroscope sets up your workspace, drops $100 of free credit into your account, and you land on the dashboard.

From this point on, every PR you push to that repo gets reviewed automatically.

Step 2: Push a PR and wait for the review

Code up a storm in Claude Code as normal and submit a new PR when you're ready.

Macroscope will review it and either pass it or leave comments. When the review is done, you’ll get a notification.

That’s your cue to bring in the skills.

Step 3: Install the two skills in Claude Code

Copy the first skill below, paste it into Claude Code, and ask Claude to install it as a skill.

Do the same for the second.

Then start a new Claude Code session so the new skills load.

Skill 1: review-pr-comments

---
name: review-pr-comments
description: Investigates unresolved PR review comments on the current branch's PR and presents findings for user review. Read-only — never modifies the PR, never resolves threads, never changes code.
---

# review-pr-comments

Investigate every unresolved review comment on the current branch's PR, classify each as **Valid** or **Invalid** with evidence, and present findings for the user to review. This skill is strictly read-only: it never posts, resolves, edits, or pushes anything.

## Guiding principle

**Accuracy matters more than speed.** A wrong dismissal ships a bug; a wrong acceptance wastes developer time. When in doubt, mark **Valid**.

- **Valid** is the default verdict.
- **Invalid** requires concrete, citable evidence — specific file paths and line numbers proving the comment wrong.
- If you cannot conclusively prove a comment wrong, it stays **Valid**.

## Procedure

Follow these steps in order on every invocation. Output nothing to the user until the final summary in step 7.

### Step 1 — Detect the project stack (runtime, every invocation)

Before touching the PR, build a mental model of this repo. Do not assume anything from prior sessions. Findings live only in working context — never write them into this skill file.

1. **Identify primary language(s)** by listing top-level files and checking for manifests:
- `package.json` (JavaScript/TypeScript)
- `Cargo.toml` (Rust)
- `pyproject.toml`, `setup.py`, `requirements.txt` (Python)
- `go.mod` (Go)
- `Package.swift`, `*.xcodeproj`, `*.xcworkspace` (Swift)
- `Gemfile` (Ruby)
- `pom.xml`, `build.gradle`, `build.gradle.kts` (Java/Kotlin)
- `composer.json` (PHP)
- `mix.exs` (Elixir)
- `pubspec.yaml` (Dart/Flutter)
- `*.csproj`, `*.sln` (C#/.NET)
- Any other manifest you recognize — don't stop at this list.
2. **Identify the test framework and test-file naming conventions**:
- Read test-related scripts in the manifest (e.g. `scripts`, `[tool.pytest]`, `[dev-dependencies]`).
- Look for test directories (`test/`, `tests/`, `__tests__/`, `spec/`) and note the file-name pattern (`*_test.go`, `*.test.ts`, `test_*.py`, `*Spec.swift`, etc.).
- Open one or two existing test files to confirm the actual framework in use.
3. **Locate dependency manifests and their local caches**. Examples:
- `node_modules/`, `vendor/`, `.venv/`, `target/`, `Pods/`, `.gradle/`, `~/.cargo/`.
Knowing where third-party code lives lets you read library source when a comment depends on external behavior.
4. **Skim a handful of source files** (3–6) from the main source tree to pick up conventions: naming, error handling, import style, module layout. You will reference these conventions when investigating comments.

Keep all of the above in working memory for this run only.

### Step 2 — Fetch PR context

1. Confirm the current branch: `git rev-parse --abbrev-ref HEAD`.
2. Find the PR for the current branch: `gh pr view --json number,url,headRefName,baseRefName,title,author,state`.
- If no PR exists for this branch, stop and tell the user.
3. List commits on the PR so you can later check whether a fix has already landed: `gh pr view --json commits`.
4. Fetch **all review threads** with resolution state via the GraphQL API. Example:

```bash
gh api graphql -f query='
query($owner:String!, $repo:String!, $num:Int!) {
repository(owner:$owner, name:$repo) {
pullRequest(number:$num) {
reviewThreads(first:100) {
nodes {
id
isResolved
isOutdated
path
line
originalLine
comments(first:50) {
nodes {
id
author { login }
body
createdAt
diffHunk
commit { oid }
originalCommit { oid }
}
}
}
}
}
}
}' -F owner= -F repo= -F num=
```

5. **Filter to threads where `isResolved` is `false`.** Ignore resolved threads entirely.
6. If pagination is needed (more than 100 threads or 50 comments in a thread), page through it — do not silently truncate.

### Step 3 — Investigate each unresolved comment

For each unresolved thread, do not rely on the `diffHunk` alone. Read real code.

1. **Read the full file** the comment points at, at the current tip of the PR branch — not just the hunk. The reviewer's claim may depend on code elsewhere in the file.
2. **Trace code paths the claim depends on.** If the comment is about a function, follow the callers, the callees, and any types involved until you can confirm or rule out the claim. Cross-file reads are expected.
3. **Check whether a later commit has already addressed it.** Compare the commit the comment was written against (`originalCommit.oid`) with the PR's latest commit. If the file at the latest commit no longer has the issue, note that as evidence.
4. **Use the stack knowledge from Step 1** when the comment references language behavior, framework idioms, or test patterns. Verify claims against the actual framework in use, not a generic assumption.
5. If the comment references external library behavior, read the library source from the local cache identified in Step 1 rather than guessing.

### Step 4 — Consider both directions before deciding

For every comment, explicitly hold both hypotheses before concluding:

- **What evidence supports the comment being correct?** List it.
- **What evidence supports the comment being incorrect?** List it.

Do not stop at the first piece of evidence you find in either direction. A single matching or mismatching line is not enough — look for confirming and disconfirming evidence on both sides.

Only after weighing both sides do you choose a verdict.

### Step 5 — Classify

- **Valid** (default): the comment appears correct, or you cannot prove it wrong with concrete evidence.
- **Invalid**: you have concrete, citable evidence — specific file paths and line numbers — that the comment is wrong, already fixed on this PR, or based on a misreading of the code.

If the evidence is ambiguous, the verdict is **Valid**. Do not downgrade a valid concern to invalid because it seems minor.

### Step 6 — Re-read before finalizing

For each verdict, do one final pass:

1. Re-read the original comment verbatim.
2. Re-read the code you cited.
3. Confirm you investigated **the exact claim the reviewer made**, not a neighboring or paraphrased version. If you find you investigated the wrong thing, redo Step 3 for that comment.

### Step 7 — Present findings

Output only the final summary. Do not narrate the investigation.

Use a single numbering sequence across both sections: list all **Valid** items first, then continue the same count into **Invalid**. This way an override like "actually #3 is valid" unambiguously identifies one item.

Format:

```markdown
## Valid (investigate further / likely to act on)

1. **:** — @
>
Evidence:

2. ...

## Invalid (recommend dismissing)

3. **:** — @
>
Evidence:

4. ...
```

End with exactly one question: **"Would you like to proceed, or adjust any verdicts?"**

Do not suggest fixes, do not draft replies, do not take any action beyond presenting the list.

## Hard constraints

- **Read-only.** Never run `gh pr review`, `gh pr comment`, `gh api ... -X POST/PATCH/DELETE`, `git commit`, `git push`, or any mutating command.
- **Never resolve threads.** Do not call `resolveReviewThread` or any equivalent.
- **Never modify files** in the working tree.
- **Never write stack-detection findings back into this skill file.** They are per-run context only.
- **Do not output intermediate progress.** The only thing the user sees is the final summary in Step 7.

Skill 2: resolve-pr-comments

---
name: resolve-pr-comments
description: Acts on a prior triage produced by review-pr-comments — rejects items marked invalid (replies + resolves threads) and fixes items marked valid (edits code, replies, resolves). Refuses to run without a prior triage in the current conversation. Does not commit or push.
---

# resolve-pr-comments

Execute the triage produced earlier in this conversation by `review-pr-comments`: reject the comments marked invalid, fix the comments marked valid, and leave the branch ready for the user to commit and push.

## Hard preconditions

- **A triage from `review-pr-comments` must already exist in this conversation.** If it does not, stop immediately and respond with:
> No triage found. Run `review-pr-comments` first, review the results, then invoke this skill.
Do not attempt to re-investigate the comments from scratch.
- **Never commit, never push, never force-push.** The user handles version control after reviewing the result.

## Procedure

Follow these steps in order on every invocation.

### Step 1 — Detect the project stack (runtime, every invocation)

Build a mental model of the repo. Do not carry assumptions from prior sessions. These findings live only in working context — never write them back into this skill file.

1. **Identify primary language(s)** from top-level manifest files. Check for (non-exhaustive): `package.json`, `Cargo.toml`, `pyproject.toml` / `setup.py` / `requirements.txt`, `go.mod`, `Package.swift` / `*.xcodeproj`, `Gemfile`, `pom.xml` / `build.gradle` / `build.gradle.kts`, `composer.json`, `mix.exs`, `pubspec.yaml`, `*.csproj` / `*.sln`. Treat any unfamiliar manifest you encounter as a signal too.
2. **Identify the test runner command.** Look in order at:
- Scripts in the primary manifest (e.g. `scripts.test`, `[tool.*]` sections, `[dev-dependencies]`)
- `Makefile`, `justfile`, `Taskfile`, `Rakefile`, or similar task definitions
- `README` / `CONTRIBUTING` for documented commands
- Language defaults (`npm test`, `pytest`, `go test ./...`, `cargo test`, `swift test`, `bundle exec rspec`, etc.) only if the above give nothing
3. **Identify the test framework and test-file naming convention** by opening one or two existing test files. Note path patterns (e.g. sibling to source vs. a dedicated tree) and file-name patterns.
4. **Skim 3–6 source files** from the main source tree to absorb conventions: naming, error handling, import/module style, logging, typing, formatting.
5. **Note existing utility/helper modules.** Record the names and purposes of shared helpers (validation, HTTP, logging, date handling, etc.) so fixes can reuse them rather than reinventing.

Keep all of the above in working memory for this run only.

### Step 2 — Apply user overrides to the triage

Read everything the user said in this conversation after the triage was presented.

- If they reclassified any numbered item (e.g. "actually #3 is valid", "mark #5 invalid"), update the verdict for that item.
- If they said to skip or defer any item, flag it as **skip**.
- If they gave rationale for an override, keep that rationale — you will use it when writing the reply on the thread.
- **User overrides always win** over the triage's original verdicts.

If the user's instructions are ambiguous (e.g. "actually the last one"), stop and ask them to clarify before taking any action.

### Step 3 — Re-fetch review threads and match to triage items

The triage numbering may no longer correspond to anything on GitHub if threads shifted. Re-establish IDs.

1. Get the PR number for the current branch: `gh pr view --json number,url,headRefName`.
2. Fetch all review threads via GraphQL, including thread IDs and comment IDs:

```bash
gh api graphql -f query='
query($owner:String!, $repo:String!, $num:Int!) {
repository(owner:$owner, name:$repo) {
pullRequest(number:$num) {
reviewThreads(first:100) {
nodes {
id
isResolved
path
line
originalLine
comments(first:50) {
nodes { id author { login } body }
}
}
}
}
}
}' -F owner= -F repo= -F num=
```

3. Page through results if needed. Do not silently truncate.
4. For each triage item, match to a thread by **file path + line** (falling back to `originalLine`) and by the first reviewer comment body. If a triage item cannot be matched to any unresolved thread, stop and ask the user how to proceed.
5. Ignore threads already resolved.

### Step 4 — Phase 1: reject invalid items

For each item currently classified **invalid** (after user overrides), do two things — and run them in parallel where the tool layer allows, since they are independent across threads:

1. **Post a reply** on the thread explaining the reason. Keep it brief and cite the evidence from the triage. Example shape:
> . See `:` — .
Use `gh api` to POST a review comment reply:

```bash
gh api -X POST \
repos///pulls//comments//replies \
-f body=''
```

(Reply to the **first comment id** on the thread.)
2. **Resolve the thread** via the `resolveReviewThread` GraphQL mutation:

```bash
gh api graphql -f query='
mutation($id:ID!) {
resolveReviewThread(input:{threadId:$id}) { thread { id isResolved } }
}' -f id=
```

Do **not** touch code in this phase.

### Step 5 — Phase 2: fix valid items, one at a time

Handle each **valid** item sequentially. Do not batch. For each item, complete every sub-step before moving to the next item.

1. **Re-read the file and surrounding code.** Do not rely only on the diff hunk or the triage evidence. Understand the patterns around the change site.
2. **Make the fix**, matching the conventions you detected in Step 1:
- Same naming style, import style, error handling idioms, and module layout
- Reuse helpers identified in Step 1 rather than duplicating logic
- Don't introduce a new dependency unless the comment specifically asks for one
3. **Verify the fix locally before moving on**:
- Re-read the changed region and its callers. Confirm the fix addresses the exact claim from the original comment, not an adjacent one.
- If tests exist that exercise the changed code, run them with the test runner from Step 1. If a test fails, investigate and adjust — don't move on with a red test.
4. **Reply and resolve the thread**:
- Post `Fixed.` as a reply to the thread (same reply endpoint as Step 4).
- Resolve the thread via `resolveReviewThread`.
5. **Tell the user** what changed — one line naming the file(s) touched and a short description.

If a valid item turns out to be too complex, ambiguous, or to require context you don't have, **do not guess**. Treat it as a skip (see Step 7).

### Step 6 — Final review pass

After every planned fix is done, do a holistic review before reporting.

1. Re-read every file you changed, together. Confirm:
- Each fix follows the codebase's existing patterns from Step 1.
- Each fix uses existing helpers instead of duplicating logic.
- Fixes don't conflict with, shadow, or duplicate each other.
- No unrelated behavior was changed, no new TODOs left behind, no debug output left in.
2. **Run the full test suite once**, using the runner from Step 1. Per-fix runs from Step 5 don't reflect interactions between fixes.
3. If anything fails or looks wrong, fix it before reporting. If a failure is outside the scope of this PR's review comments, stop and tell the user rather than silently papering over it.

### Step 7 — Handle skipped items

For any item the user asked to skip, or that you determined you can't responsibly handle:

1. Reply on the thread: `Acknowledged — not addressing in this PR.`
2. Resolve the thread.
3. Include it in the skipped count in the final report.

### Step 8 — Report

Output exactly one line:

```plaintext
Done — N invalid rejected, M valid fixed and resolved.
```

If any items were skipped, append:

```plaintext
K skipped and acknowledged.
```

Do not commit. Do not push. Do not open a new PR. The user takes it from here.

## Hard constraints

- **Precondition-gated.** No triage in the conversation ⇒ refuse and direct the user to `review-pr-comments`.
- **User overrides always win** over the original triage verdicts.
- **Never commit, stage, push, force-push, or amend.**
- **Never write stack-detection findings back into this skill file.** They are per-run context only.
- **Sequential fixes only.** Valid items are handled one at a time, with local verification between them. Only the reject operations in Phase 1 may run in parallel.
- **No silent truncation** when paging through threads or comments.
- **No guessing.** If a fix is unclear or an override is ambiguous, ask or skip — don't fabricate.

That’s it. You’re set up.

Step 4: Run the loop

Once you’ve got a notification saying Macroscope’s review is complete:

Run /review-pr-comments.
Push back if you disagree.
Run /resolve-pr-comments.
Commit, push, repeat.

That’s the whole workflow. Write code. Push. Triage. Respond. Repeat until clean. It's a fast and low-friction defensive layer to catch bugs before they make it into production.

Enjoy!

References

Rahman, M. & Shihab, E. (2026). Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source. EASE 2026. https://arxiv.org/abs/2601.16809

Watanabe, K., Shirai, T., Kashiwa, Y., & Iida, H. (2026). What to Cut? Predicting Unnecessary Methods in Agentic Code Generation. MSR 2026. https://arxiv.org/abs/2602.17091

Financial Times. (2026). Amazon holds ‘deep dive’ into impact of AI coding tools after outages. https://www.ft.com/content/7cab4ec7-4712-4137-b602-119a44f771de

The author has a professional affiliation with Macroscope. AI tooling moves fast and some of the implementation details above may differ since publication.

➡️ Get Macroscope with $100 of free credit.

Best AI Code Review Tools | April 2026 Edition

Lewis — Wed, 22 Apr 2026 21:53:17 +0000

The world's best engineers have stopped writing code.

Coding agents now handle the bulk of implementation, and they do it at a pace no human can match.

The trouble is that more bugs are shipping with it.

Earlier this year, Concordia researchers tracked 200,000 code units across 201 projects and found AI-written code gets bug-fixed at a higher rate than human-written code.

Amazon recently pulled engineers into a post-mortem on a run of AWS outages, citing "novel GenAI usage for which best practices and safeguards are not yet fully established."

Two weeks ago, Axios, Mercor, Railway, and the Argentinian government all disclosed security breaches. This week it was Vercel's turn. How many of them started with AI code that nobody properly reviewed?

Coding agents have shifted the bottleneck from writing code to reviewing it. The volume is crushing teams, leading many to drop the ball during code review.

A new NAIST study of 1,664 agentic pull requests found that 75% of them merged with zero revisions requested. Three out of four AI-generated PRs shipped without a single human change, despite evidence that they contain more bugs than human-written code.

Manual review has hit its breaking point.

If you ship with a coding agent, you need a code review agent.

The good news is the best AI code reviewers now catch bugs at a higher rate than the first-generation tools everyone turned off, and with a fraction of the noise. Many have surpassed human reviewers on precision.

You just need to pick the right one.

This is my April 2026 roundup of the 8 most commonly asked about AI code review tools, complete with the latest features, strengths, weaknesses, pricing tiers, benchmark performances, and guidance on which one to choose for your stack.

What to look for in an AI code review tool

Before we look at the tools, here are the features that separate ones developers actually keep enabled from the ones they turn off after a week.

Bug detection accuracy

The single most important requirement. Detection rates vary enormously across tools—even on vendor-friendly benchmarks, the gap between the best and worst is consistently over 2x.

Signal-to-noise ratio

Detection means nothing if every real finding is buried under twenty comments about variable naming. Noise is the #1 reason developers turn these tools off. The best tools are moving toward fewer, higher-confidence comments.

Codebase context

A tool that only sees the diff misses the most dangerous bugs: the ones that emerge from how a change interacts with the rest of the codebase. The best tools build a representation of your entire repo and use it when reviewing each PR.

Auto-fix capability

Most tools stop at flagging. A few go further by opening a branch, committing a fix, running CI, and self-healing if CI fails. Closing the feedback loop between "bug flagged" and "bug fixed" turns hours of back-and-forth into minutes.

Independence

Institutions are not supposed to audit themselves. Your agents are no different. If your code generation tool (Cursor, Copilot, Codex) is also reviewing the code it wrote, you get confirmation bias at scale. Dedicated reviewers come at the code from the outside.

Language coverage

A Go reviewer that misses goroutine patterns isn't providing real value. If your stack is polyglot, check the tool's been tested on every language you ship.

Platform support

GitHub users have plenty of options but GitLab and Bitbucket teams have few. Beyond your code host, integrations with Slack, Jira, and Linear determine how quickly your team actually adopts the tool.

The 8 best AI code review tools in 2026

Here are the 8 tools this article covers.

There are 5 dedicated AI code review tools and 3 widely-used tools where code review is one feature among many.

Dedicated AI code review tools:

1/ Macroscope
2/ CodeRabbit
3/ Cursor Bugbot
4/ Greptile
5/ Graphite Diamond

Broader tools with code review features:

6/ GitHub Copilot
7/ Qodo
8/ Claude Code Review

1. Macroscope

Pricing: Usage-based, ~$0.95/review, median $0.50 | Free for open source | $100 free credit to start

Platforms: GitHub

Macroscope was founded by Kayvon Beykpour (co-founder of Periscope, former Head of Consumer Product at Twitter), Joe Bernstein (co-founder of Periscope), and Rob Bishop (co-founder of Magic Pony). Both companies were acquired by Twitter, where the trio led product and engineering across 3,000+ engineers.

Macroscope consistently ships category-first features and was one of the earliest to pioneer AST walkers for review.

More recent innovations:

Usage-based pricing. ~$0.95 per review on average, rather than $30-40 per seat per month whether or not that seat is shipping code. Saves enterprises real money on dormant contractors and inactive accounts.
Auto-tune. A novel prompting technique that tests thousands of model, prompt, and parameter combinations per language to find the highest-performing configuration. It's how Macroscope shipped v3 of their code review engine, which reports 98% precision and 22% lower comment volume than its predecessor (nitpicks down 64% in Python and 80% in TypeScript).
Autonomy. "Fix It For Me" auto-creates a branch, commits the fix, opens a PR, runs CI, and self-heals if CI fails. "Approvability" goes further by auto-approving low-risk PRs (docs, unit tests, simple bug fixes) without a human in the loop. Approvability is the only autonomous approval feature on this list.

Macroscope has also proven popular beyond engineering, thanks to reporting features aimed at leaders and non-technical team members. Its "Status" functionality classifies every commit into "Areas" (product teams or business units), summarizes them in plain language, and sends weekly digests by email—giving execs, PMs, and operations teams visibility into what's shipping.

As Tim Watson, CTO and co-founder at Intro, puts it: "I've been using a few different code review bots for a while now and Macroscope is easily the best one. Still catches things no matter how thorough I am before I push."

Other features worth noting:

Agent. Invokable from Slack, GitHub, or via API to answer questions about your codebase or run tasks across your stack. Example queries:

"How does our auth flow work?"

"Which feature flags are live in production, and how many signups did we get last week?"

"Sentry shows a spike in this error; track down the cause, open a PR to fix it, and file a Jira ticket for Eliza to QA."

Agent treats your codebase, git history, and connected tools (Jira, Sentry, BigQuery, PostHog, LaunchDarkly) as one queryable stack. 1,000 free credits/month, then ~$0.07 per quick question or ~$4.70 per deeper research task.

Integrations. One of the widest integration surfaces in the category, plus any MCP-compatible server, so teams can extend it to Datadog, PagerDuty, or internal tools themselves.

Considerations

GitHub only. No GitLab or Bitbucket support.
Smaller public footprint and shorter track record than more established competitors.

Try Macroscope with $100 free credits.

2. CodeRabbit

Pricing: Free (PR summaries + IDE reviews only) | Pro $24/dev/mo annual ($30 monthly) | Pro Plus $48/$60 for custom rules and higher limits

Platforms: GitHub, GitLab, Bitbucket, Azure DevOps

CodeRabbit is the most widely adopted AI code review tool on the market, with over 2 million repos connected and customers including Brex and PostHog.

It's the most battle-tested option on this list as the longest-running dedicated AI reviewer, with enterprise self-hosting available and the broadest production track record.

It's also the only tool on this list that supports all four major code hosts, so if your team isn't on GitHub, CodeRabbit is likely your strongest option.

The main drawback seems to be noise. Martian's independent benchmark has CodeRabbit scoring at the bottom of the pack on precision for offline PRs, and a handful of grumpy Redditors echo the same complaint.

That said, CodeRabbit catches a high number of real bugs, and the noise can be managed with rule configuration. They're also shipping improvements fast, with Multi-Repo Analysis in March 2026, and Autofix in April.

Key features

Broadest code-host platform support. The only tool on this list covering GitHub, GitLab, Bitbucket, and Azure DevOps.
Autofix (April 2026, early access). Click a checkbox on a review comment and a coding agent spawns to write the fix, commit to your branch, and run build verification. Pro plan, GitHub only, won't auto-merge.
Multi-Repo Analysis (March 2026). When a PR changes a shared API, type, or schema, CodeRabbit checks linked repos for downstream breakage. Useful for microservices teams. Pro plan includes 1 linked repo; Pro Plus raises it to 10.
PR summaries + diagrams. Auto-generated summaries with architectural diagrams. Positive online sentiment around this feature.
Customizable review guidelines. YAML-based config for your team's coding standards, plus natural-language pre-merge rules like "block PRs with hardcoded credentials." This is also how teams manage the noise.
Integrations. Native Jira, Linear, CircleCI. Broader integrations (Slack, Confluence, Notion, Datadog, Sentry) via MCP—5 connections on Pro, 15 on Pro Plus.

Considerations

Noise requires upfront configuration to manage. Without rule tuning, teams often report the signal-to-noise ratio becomes a real cost over time.

Try CodeRabbit with a 14-day free trial.

3. Cursor Bugbot

Pricing: Pro $40/user/mo (200 PRs/mo cap, individual) | Teams $40/user/mo (unlimited PRs, analytics) | Enterprise custom | 14-day free trial | Cursor IDE sold separately

Platforms: GitHub, GitLab

Bugbot is Cursor's AI code review agent. A Cursor IDE subscription isn't required to use it, but if you do use Cursor, the integration is tighter.

Community sentiment on Bugbot's review quality is mostly positive. Users describe reviews as "clean and focused", and it tends to score well on precision across third-party benchmarks—skipping formatting and style nitpicks in favor of real bugs.

Cursor recently shipped Bugbot Autofix, which spawns cloud agents to fix issues it finds, and reports its resolution rate has climbed from 52% to 76%.

Two things to consider. At $40/user/month (on top of any Cursor subscription), Bugbot is among the most expensive options on this list, and the per-seat model means cost scales with headcount whether or not everyone is shipping. Second, independence. If your team already uses Cursor for code generation, Bugbot means the same ecosystem is writing and reviewing—a trade-off worth weighing.

Key features

Bugbot Autofix. Launched February 2026. Spawns cloud agents that work in their own VMs to fix issues Bugbot finds. April updates added a "Fix All" action for resolving multiple fixes at once, and tightened Autofix to only run on substantial findings.
Learned Rules (April 2026). Bugbot learns from developer reactions—downvotes, replies, human reviewer comments on the same PR—and turns those signals into rules that shape future reviews. Candidates become active rules once they accumulate signal, and get retired if they start generating negative feedback.
GitHub + GitLab. Works with both PRs and merge requests, can be enabled as a mandatory pre-merge check.

Considerations

Per-seat pricing adds up fast, separate from any Cursor licenses.
Same-vendor review. Bugbot is made by Cursor, so teams using Cursor to generate code end up with the same ecosystem writing and reviewing it.

Start a 14-day Bugbot free trial.

4. Greptile

Pricing: $30/dev/month (includes 50 reviews, $1/review after) | 14-day free trial

Platforms: GitHub, GitLab

Greptile indexes your entire repository and builds a code graph, then uses multi-hop investigation to trace dependencies, check git history, and follow leads across files.

Its v3 release (late 2025) uses the Anthropic Claude Agent SDK for autonomous investigation, and v4 shipped in March 2026 with further quality improvements.

One of its most distinctive features is the confidence score: each review gets a rating out of 5, used to triage which PRs need immediate human review. Plenty of customers have taken to social media to share their 5/5 scores!

Greptile also has broad coverage across languages and integrations—30+ languages with 12 fully supported, plus connections to Jira, Notion, and Google Drive, and a dedicated Claude Code plugin that brings review commentary directly into terminal-based workflows.

While Greptile's thought leadership is popular on Hacker News, some commenters there have noted that false positives caused them to abandon the tool after a short trial.

Pricing is another sticking point. The hybrid model—$30/dev/month for up to 50 reviews, then $1 per review after—effectively combines the worst of both worlds: you pay for every seat (including dormant ones) and pay per review for your most active developers. For larger teams this stacks quickly, making Greptile one of the more expensive options on this list.

Key features

Confidence scoring. Each review gets a score out of 5 that teams use to prioritize which PRs need human attention.
PR summaries with Mermaid diagrams. Auto-generated summaries include visual diagrams and file-by-file breakdowns.
Learning from your team. Greptile infers coding standards by reading engineer comments and tracking reactions, adapting reviews over time.
30+ languages. Broad language support with 12 languages fully supported.
Integrations. Jira, Notion, Google Drive, and a Claude Code plugin for terminal-based review workflows.

Considerations

Community sentiment on precision has been mixed. v4 (March 2026) is aimed at improving this.
Per-seat + per-review pricing can make Greptile one of the more expensive tools on this list for large or mixed teams.

Try Greptile with a 14-day free trial.

5. Graphite

Pricing: Free (limited AI reviews) | $40/user/month unlimited (annual) or $50 monthly

Platforms: GitHub

Graphite is a different kind of entry on this list. It's built around stacked PRs (the workflow for breaking large changes into small, sequential PRs and merging them in order) and its AI reviewer (Diamond) is one feature within that platform.

If your team wants to adopt stacked workflows, Graphite is the tool on this list. The question is whether the bundled AI reviewer earns its place alongside the dedicated alternatives.

The unfortunate reality is that by most independent measures, it doesn't yet. Graphite Diamond ranked last for bug detection on Martian's independent benchmark, on both offline and online PRs. Negative community feedback tracks with the data.

The reviewer is quiet and low-noise, but that comes at the cost of missing critical bugs. If you're evaluating Graphite primarily for AI code review, the dedicated tools higher on this list are stronger choices.

The picture may change. In December 2025, Graphite was acquired by Cursor, and the team has said they plan to "combine the best of Diamond and Cursor's Bugbot into the most powerful AI reviewer on the market." For now, Graphite operates independently, but the standalone Diamond product's future is tied to that merger.

Key features

Stacked PRs. Graphite's core differentiator. Break large changes into small, dependent PRs and keep shipping while earlier ones are under review. Graphite handles the rebasing automatically—the part that makes stacking painful in native Git.
Merge queue. Stack-aware merging that keeps your main branch green. Pairs naturally with stacked PRs to prevent the merge conflicts that plague teams doing this workflow manually.
Graphite Agent. Fix CI failures and get instant context on code changes directly from the PR page. Requires the Team plan.
Integrations. Slack notifications, CLI, and VS Code extension for managing stacks.

Considerations

Ranked last for bug detection on Martian's independent benchmark, with community sentiment reflecting the same—reviews are quiet but miss critical bugs.
GitHub only.
Acquired by Cursor in December 2025. Plans to merge Diamond with Bugbot mean the standalone AI reviewer may look very different in six months.

Try Graphite free for 30 days.

6. GitHub Copilot

Pricing: Pro $10/mo (300 requests) | Pro+ $39/mo (1,500) | Business $19/user/mo (300) | Enterprise $39/user/mo (1,000). No unlimited plan; shares a monthly request pool with other Copilot features.

Platforms: GitHub

GitHub Copilot is a code completion and AI assistant that includes a review feature. You request a review from Copilot in the GitHub UI the same way you'd request one from a teammate, and it leaves inline comments with suggested fixes.

If your team already pays for Copilot, code review is bundled at no extra cost.

But there are two structural limitations worth weighing.

First, every review consumes a "premium request" from a shared monthly pool that also covers chat, agent mode, and the coding agent—heavy use of other features leaves fewer reviews available.

Second, GitHub's own documentation advises using Copilot review to "supplement human reviews, not to replace them." The dedicated code reviewers are moving in a more ambitious direction.

In March 2026, GitHub rebuilt Copilot code review on an agentic architecture that now explores the repo for broader context—whether this closes the depth gap with dedicated reviewers is too early to tell.

Key features

Zero setup. Lives natively in GitHub. No new tool to install, no vendor to onboard. Request a review from the Reviewers menu and get comments in under 30 seconds.
Suggested changes. One-click apply for code suggestions. Can also invoke Copilot's coding agent to implement fixes as a new PR against your branch (public preview).
CLI access (March 2026). Request a review from the terminal with gh pr edit --add-reviewer @copilot.
Custom instructions. Define review standards in a .github/copilot-instructions.md file.

Considerations

Not a dedicated reviewer. GitHub's own documentation recommends it supplement human review rather than replace it.
Code review shares a capped pool of premium requests with all other Copilot features. Heavy chat or agent use leaves fewer reviews available.

Learn how to get started with GitHub Copilot Code Review.

7. Qodo (formerly CodiumAI)

Pricing: Free (30 PRs/month) | $30/user/mo annual ($38 monthly) | Enterprise custom

Platforms: GitHub, GitLab, Bitbucket, Azure DevOps

Qodo is a broader quality platform where PR review sits alongside IDE-level review, test generation, and compliance reporting. PR review style leans toward structured summaries over line-by-line comments.

It has the category's second-widest platform coverage after CodeRabbit, lets you choose your LLM, and offers on-prem, air-gapped, and single-tenant VPC deployment for Enterprise customers.

The trade-off is that if deep line-by-line PR review is your main need, the dedicated tools earlier in this list have more depth. Qodo's sweet spot is teams who want the broader quality workflow in one tool, or who need deployment flexibility the dedicated tools don't offer.

Key features

Test generation (Qodo Cover). Point it at a function and it produces edge-case unit tests. Unique on this list.
Compliance checks. Validates PRs against security policies, ticket traceability, and org-specific rules. Posts a structured report rather than line comments.
Rules System (February 2026). Qodo reads your codebase and past feedback to auto-generate rules, then enforces them on every PR.
IDE review. Catches issues in VS Code and JetBrains before you open a PR, with one-click AI fixes.
CLI agent framework. Build custom review agents for your CI/CD pipelines. Supports MCP server mode.
Model flexibility. Choose your LLM: Claude, OpenAI, Gemini, DeepSeek, Meta, or Qodo's own.
Integrations. Jira, Monday.com, Linear for ticket context.

Considerations

Not a dedicated PR reviewer by design. Broader product scope means more to learn before getting full value.
Advanced deployment options (on-prem, air-gapped, VPC) require the Enterprise plan at custom pricing.

Learn how to get started with Qodo.

8. Claude Code Review

Pricing: Token-based, averaging $15–25 per review | Teams and Enterprise plans only (not Pro/Max/ZDR) | Billed as extra usage

Platforms: GitHub (managed); GitHub + GitLab via self-hosted CI/CD

Claude Code is Anthropic's AI coding agent, widely considered one of the best at code generation. In March 2026, Anthropic launched Claude Code Review—a multi-agent PR reviewer built on top of it. Specialized agents analyze the diff in parallel, a verification step filters false positives, and surviving findings post as severity-ranked inline comments.

Claude Code is my own agent of choice for writing code, but my last pick for reviewing it. The model that wrote the code introduced the bugs, making it less equipped to find them than an independent reviewer. Anthropic's multi-agent architecture is their deliberate answer, but Claude Code Review hasn't yet placed in Martian's top ten—suggesting the same-model blind spot isn't fully solved.

Cost and access make the picture harder. At $15–25 per review, Claude Code Review is among the most expensive options on this list. For comparison, Macroscope, the only other pure usage-based tool here, averages ~$0.95. Runtimes also tend to be slower at ~20 minutes per PR, and it's restricted to Teams and Enterprise plans, with no Zero Data Retention support.

Key features

Multi-agent PR review. Specialized agents analyze the diff in parallel for different classes of issue (logic, security, regressions). A verification step filters false positives before posting.
Severity ranking. Findings are tagged 🔴 Important (blocker), 🟡 Nit (minor), or 🟣 Pre-existing. Also surfaced as a CI check run for custom gating.
Custom rules. A REVIEW.md file gives review-only instructions (severity tuning, nit caps, skip rules). CLAUDE.md handles project-wide architecture.
CLI plugin. Run /code-review directly from the terminal to get feedback on local diffs before pushing.
Self-hosted CI/CD. GitHub Actions and GitLab CI/CD integrations let you run Claude Code Review in your own pipelines—the only path for GitLab teams.

Considerations

Self-review: the same model that wrote your code is reviewing it. Anthropic's multi-agent architecture hasn't yet closed the gap with dedicated tools on Martian's benchmark.
Teams/Enterprise only. No Zero Data Retention support, which rules it out for regulated industries.
Token-based pricing runs high (~$15-25/review), and reviews take ~20 minutes per PR.

Learn how to get started with Claude Code Review.

How the tools compare

Tool	Pricing	Platforms
Macroscope	~$0.95/review (usage-based)	GitHub
CodeRabbit	$24–48/dev/mo (per-seat)	GitHub, GitLab, Bitbucket, Azure
Cursor Bugbot	$40/user/mo (per-seat)	GitHub, GitLab
Greptile	$30/dev/mo + $1/review over 50	GitHub, GitLab
Graphite	$40–50/user/mo (per-seat)	GitHub
GitHub Copilot	Bundled (capped request pool)	GitHub
Qodo	$30–38/user/mo; Enterprise custom	GitHub, GitLab, Bitbucket, Azure
Claude Code Review	$15–25/review (token-based)	GitHub (+ GitLab via self-hosted CI)

Free for open-source: Macroscope, CodeRabbit, Greptile

How to choose

You're not on GitHub. If your team uses GitLab, Bitbucket, or Azure DevOps, most of this list is off the table. CodeRabbit and Qodo support all four major code hosts. Bugbot and Greptile cover GitHub and GitLab. Claude Code Review can run on GitLab via self-hosted CI/CD.

You're cost-sensitive. Qodo's free tier covers 30 PRs/month—no credit card needed for small teams. CodeRabbit is the cheapest flat per-seat option at $24/dev/mo. For larger enterprises, Macroscope's usage-based pricing (~$0.95/review) scales with actual activity rather than headcount, avoiding dormant-seat costs.

You want bugs fixed, not just flagged. Macroscope goes furthest on autonomy—"Fix It For Me" commits and runs CI, "Approvability" auto-approves low-risk PRs without human review. Bugbot's "Autofix" (with April's "Fix All") and CodeRabbit's "Autofix" (early access) spawn agents to write fixes, though neither auto-merges. Copilot can also hand suggestions to a cloud agent.

You need more than just code review. Graphite is built around stacked PRs and merge queues. Qodo is a broader quality platform—test generation, compliance, IDE review. Macroscope's Status feature (commit classification by product area, executive summaries, weekly digests) gives leadership visibility into what's shipping.

You prefer to stay within one ecosystem. Some teams may value tight integration over independence. If you already use Claude Code to generate code, Claude Code Review lives in the same workflow. If your team pays for Cursor IDE, Bugbot is natively integrated. If GitHub Copilot is already in your subscription, code review is bundled at no extra cost. And if you want to deeply adopt stacked PRs on GitHub, Graphite ties the whole workflow together. Each of these comes with the independence trade-off flagged earlier in this article.

Where this is going

The best AI code review agents now rival human reviewers on precision. That doesn't mean they catch everything, and it certainly doesn't mean you can take humans out of the loop.

But AI code review is the only way to keep up with the firehose of output from AI coding agents.

Today, these tools have started to take on a degree of autonomy — fixing the bugs they detect, sometimes even merging low-stakes code without a human in the loop. The trend will continue. As confidence grows, review agents will take over more and more of the work that today still falls to engineers.

Whichever tool you choose, the decision to adopt one is already made for you. The only question left is which.

References

Every effort has been made to ensure the information in this article is accurate as of its time of writing. The AI tooling space moves fast—visiting each vendor's site for the latest details is advised. The author has a professional affiliation with Macroscope, referenced in this article.

Rahman, M. & Shihab, E. (2026). Will It Survive? Deciphering the Fate of AI-Generated Code in Open Source. EASE 2026. https://arxiv.org/abs/2601.16809

Watanabe, K., Shirai, T., Kashiwa, Y., & Iida, H. (2026). What to Cut? Predicting Unnecessary Methods in Agentic Code Generation. MSR 2026. https://arxiv.org/abs/2602.17091

Financial Times. (2026). Amazon holds 'deep dive' into impact of AI coding tools after outages. https://www.ft.com/content/7cab4ec7-4712-4137-b602-119a44f771de

AI code lasts longer than human code, new study finds (but not for the reason you’d hope).

Lewis — Tue, 17 Mar 2026 17:09:50 +0000

AI-generated code survives 16% longer in production than human code before anyone touches it.

That's the headline finding from a new study out of Concordia University's DAS Lab, led by Emad Shihab and accepted at EASE 2026.

The researchers tracked over 200,000 individual code units across 201 open-source projects using survival analysis, a method borrowed from medical research, to answer a simple question: how long does AI-generated code last in production?

The answer is agent-authored code has a 15.4 percent lower modification rate than human code. At any given time, it faces 16% less risk of being changed.

Sounds like a win for AI coding agents, right?

Not necessarily.

The researchers wanted to understand why AI code was being left untouched for longer. Is it because it's higher quality, or something else?

The data suggests something else.

When they examined what happens when AI code is finally modified, they found that 26.3% of modifications to AI code are bug fixes, compared to 23% for human code. When someone finally touches AI-generated code, it's more likely to be because something was broken.

So AI code appears to contain more latent bugs, yet those bugs sit unaddressed for longer. Why?

The researchers point to a well-documented phenomenon in software engineering as one possible reason: the "Don't touch my code!" effect. Developers avoid modifying code they didn't write. AI-generated code has no human author so nobody feels responsible for maintaining it.

Nuance in the per-tool data.

The study tracked five AI coding tools, and the results between them varied.

Cursor, one of the most sophisticated tools tested, had the lowest corrective modification rate of any tool at just 13.8%. When someone touches Cursor-assisted code, it's rarely to fix a bug.

Yet Claude Code, also a powerful offering, had a corrective rate of 44.4%, nearly double the human baseline.

One possible explanation here is that Cursor tends to keep the code visible in the interface as you work, but Claude Code has an interface that abstracts the code further away from the developer's view.

The idea that the degree to which a developer sees, understands, and engages with the code during generation matters as much as the quality of the tool itself is a sensible theory.

But a stronger clue for why AI generated code survives longer comes from a separate study entirely.

The review burden

Amazon recently summoned a large group of engineers for a "deep dive" into a spate of outages, including incidents tied to AI coding tools. A briefing note cited "novel GenAI usage for which best practices and safeguards are not yet fully established" as a contributing factor.

Researchers at NAIST (Nara Institute of Science and Technology), in a paper accepted at MSR 2026, analyzed 1,664 merged agentic pull requests across 197 open-source projects. They found that 75% of agentic PRs pass through review with zero revisions. Three out of four AI-generated PRs sail through without a single change requested.

There is a growing chorus of developers complaining about the ballooning burden of code review. As AI coding agents improve, engineers who use them ship more code. As engineers ship more code, the volume of code that has to pass through review skyrockets.

The tidal wave of code review could be what's driving developers to rubber stamp bugs into production at AWS, in the above-referenced studies, and beyond.

Bugs that would have been caught in a more thorough review slip through. And if nobody engaged deeply with the code during review (nor at time of generation), nobody understands it well enough to feel equipped, or responsible, to maintain it later.

Amazon's response is to require junior and mid-level engineers to get senior sign-off on all AI-assisted changes. But adding more human sign-off to a process that's already struggling to keep up cannot fix the core tension.

So how do we prevent orphaned, buggy code filling up codebases, without drowning humans in an impossible mountain of manual review?

The end of human code review

The best articulation I've seen of what the answer should look like comes from Kayvon Beykpour, previously the CEO of Periscope / head of product at Twitter, and presently the cofounder of the AI code review tool Macroscope.

In a widely shared post, he predicted that "soon, human engineers will review close to zero pull requests," and that instead, "code review will become always-on and increasingly automatic as code is being written. A new orchestration layer will emerge where agents will decide when PRs are ready to merge and only (infrequently) escalate to humans."

Beykpour argues code review needs to be pulled closer to where code is being written, not delayed until a PR is opened. Specialized review agents should continuously analyze code as it's generated, verify correctness, and coordinate with coding agents to address issues in real time.

If the AI-generated code in the Concordia study had been continuously reviewed by a dedicated agent as it was written, the bugs would have already been caught, and it wouldn't have mattered that an engineer waved the code into production.

Then, at the PR stage, Beykpour says AI agents should orchestrate "merge readiness": assessing whether the code was sufficiently tested, evaluating blast radius, checking trust profiles, and deciding whether human escalation is actually required.

When low-risk PRs are taken off an engineer's plate, they have more time and bandwidth for the reviews that actually matter. And when those reviews reach them, they know they're important.

The first glimpse of this future is "Approvability," a feature rolled out by Beykpour's team last month that automatically evaluates every PR against two hurdles before deciding whether it can merge without a human reviewer.

Trusting an AI to decide whether code can merge without a human reviewer will seem reckless to some, but this is how every generation of programming has evolved.

When the compiler took over the task of writing machine code in the 1950s, programmers didn't trust it—so they inspected its binary output line-by-line, swapping writing for reviewing. Over time, a set of checks and balances were built around the compiler—listing prints, error diagnostics, optimization passes, etc—until manual verification became redundant.

The same pattern played out with operating systems, CI tools, and cloud platforms. Each initially added a burden of oversight, and each eventually earned enough trust that the oversight became unnecessary.

While the above discussed research can help us diagnose some of the problems with LLM-written code, that it contains more bugs, gets left untouched longer, and mostly sails through review unchallenged, the teams that solve these challenges won't be the ones who review harder, but rather the ones who review smarter.

From 75% to 98% Precision: The Research Paper That Changed How a Startup Prompts AI

Lewis — Wed, 11 Mar 2026 20:30:37 +0000

GPT 5.3 and Opus 4.6 dropped on the same day last month. The team at Every, the tech publication that’s become one of the go-to sources for hands-on AI coverage, ran a “vibe check”. They had their team compare the models head to head, and nobody could pick a clear winner.

Their CEO Dan Shipper uses them 50/50. Other team members each landed on a different mix. The consensus: each model has different strengths, and the best approach is to use both depending on the task.

But “depending on the task” is doing a lot of heavy lifting in that sentence. How do you figure out which model is best for which task? And once you’ve picked one, how do you write the prompt that unlocks its peak performance on that specific job? Throw in all of the other models that you may need to consider and this becomes a wicked problem.

Left unsolved, this wickedness means you’re leaving serious performance on the table, whether you’re building products on top of AI or trying to get the most out of these models in your work.

The default approach for most teams is to roll with the vibes. Pick a model. Write prompts by hand. A/B test against a benchmark. Wait for results. Tweak. Repeat. Try to build intuition about what each model is good at. Maybe add some few-shot examples. It’s better than nothing but you won’t achieve differentiated results.

The more sophisticated approach is automated prompt optimizers. These are systems where an LLM writes a prompt, scores it against a benchmark, reflects on the results, and tries to write a better one — looping until the score plateaus. The best use evolutionary approaches, maintaining multiple candidate prompts and breeding the best performers together. This is better than doing it by hand, but it hits its own ceiling — one that elite AI researchers recently pinned down to two constraints.

Brevity bias and context collapse

In their 2026 ICLR paper, the research team identified two core failure modes with prompt optimizers. The first is brevity bias: the optimizers tend to converge on short, generic prompts — because short prompts are safe. “Be careful with edge cases” never hurts on any particular test case, so it survives selection round after round. “When processing this specific type of input, check for Y” only helps 10% of the time, so it gets pruned. Over many rounds, the specific stuff dies and the generic stuff lives. You end up with prompts that are OK at everything and great at nothing.

The second failure mode is context collapse. When the optimizer asks an LLM to rewrite a large, detailed prompt, the LLM compresses it — sometimes catastrophically. The researchers showed an example where 18,000 tokens of accumulated knowledge collapsed to 122 tokens, and performance dropped below the baseline. The system literally forgot everything it had learned.

Until recently, that was the landscape. A/B testing was too slow. Automated optimization converged on mediocrity. Neither approach scaled with the speed of model progression.

I’ve been bumping up against this exact problem myself with an AI product I’m building. So when one of the startups I work with, Macroscope, published a detailed article outlining a new approach to prompt optimization they call “auto-tuning,” I leaned in extra hard.

The A.C.E. in the hole

Macroscope does AI code review. Their product needs to work across every programming language developers use, and each language has different idioms. A prompt that catches real bugs in Go flags noise in Python. Adding few-shot examples for one language broke another. New models shipped faster than they could build full intuition about the old ones.

Then at the tail-end of last year Macroscope found our aforementioned ICLR paper. In addition to diagnosing their exact problem, it proposed a solution. The researchers called it Agentic Context Engineering, or ACE.

Instead of trying to find one perfect prompt through iterative rewriting, ACE builds a playbook. Three LLM roles work together: a Generator that attempts the task, a Reflector that diagnoses what went wrong, and a Curator that adds a specific bullet point to a master playbook based on what it learned.

The key constraint is that the Curator can only add or update individual bullets. It never rewrites the whole prompt. This prevents the context from collapsing into a generic summary — the failure mode that plagues traditional prompt optimizers.

The result is a prompt that accumulates hundreds of specific, detailed entries over time. Not “be careful with date formatting,” but “when processing Venmo transactions, use datetime range comparisons, not string matching.” The model reads the full playbook at inference time and naturally pays attention to whichever entries are relevant for the current task.

ACE showed roughly 10% improvement on agent benchmarks, matching top-ranked production agents while using a smaller open-source model. 10% is cute, but Macroscope pushed ACE further.

Prompt x Model x Language

ACE optimizes the playbook for a fixed model — you pick one model and the system improves the prompt for that model. Macroscope asked a different question: what if we run this process across every model simultaneously? Same task, same benchmark, but now the system is building and testing playbooks for GPT, Gemini, Opus, and others in parallel — discovering not just the best prompt, but the best model-prompt combination.

It’s closer to having a dedicated prompt engineer iterate on prompts for every model at once, except auto-tune can test ideas in parallel and doesn’t get tired.

And when they did this, they discovered something unexpected.

Finding subtask-model fit

The system found that models have stable behavioral signatures — personality traits, essentially — that they can’t turn off. And it learned to exploit them.

GPT-5.2 hedges. When GPT is uncertain, it says things like “this could potentially cause an issue” instead of committing. The hedging leaks through even with explicit instructions to be decisive. But auto-tune discovered that this hedging correlates strongly with false positives. The model is expressing genuine uncertainty, and that uncertainty is a useful signal. Modal language like “could,” “potentially,” and “may” became a rejection filter.

Gemini 3 rambles. Gemini sometimes thinks out loud mid-response. “Wait, let me re-read that.” “However, on second thought…” When it does this, it’s usually about to get the answer wrong. The self-correction is also a tell. Auto-tune learned to catch it and those phrases became rejection signals. Not every model does this though. Opus, for example, doesn’t ramble, so it doesn’t need this filter.

Once you understand each model’s natural tendencies, you can start assigning them to the tasks they’re suited for — even pairing them in concert on the same task to achieve results neither could produce alone.

One of autotune’s most useful findings was to pair a permissive model for detection with a strict model for validation. Tell the detection model to flag everything, false positives acceptable. Then use a different model to ruthlessly filter out anything that involves hedging, speculation, or claims that can’t be proven from the code. One optimizes for recall, the other for precision.

The differences unearthed by auto-tune are not subtle. Given the same “flag everything” directive, Opus flags 199 potential issues. GPT flags 3,923. Same task, 20x different output.

The team said: “We probably wouldn’t have tried pairing different models for different subtasks without auto-tune — it seemed unnecessarily complex.”

Near perfect precision

Remember, ACE achieved roughly 10% improvement on agent benchmarks. Macroscope’s results were more dramatic.

Overall precision jumped from 75% to 98% — meaning nearly every comment the system leaves is now correct. It catches 3.5x more high-severity bugs while leaving 22% fewer comments overall. Nitpicks dropped 64% in Python and 80% in TypeScript.

Since launching v3, developer thumbs-up reactions increased 30%, comments per PR dropped 37%, and developers are resolving 10% more of the issues flagged.

To achieve these results, Macroscope also layered in a few additional engineering enhancements that they detail in their post, for instance:

Severity-weighted scoring — a critical bug scores 125x higher than a low-severity one.
Learning rate controls — at low rates, the system tweaks wording. At high rates, it rewrites entire sections.
Anti-overfitting guidance — the system is instructed to identify underlying patterns across a batch of results, not make changes to address a single specific failure.

Beyond code review

Here’s why I think this matters for more than just one code review startup.

Every AI product faces the same underlying problem. Too many models. They change too fast. Different prompts work differently across tasks. And the behavioral signatures auto-tune discovered — the hedging, the rambling, the calibration differences — aren’t specific to code review. They’re properties of how these models reason. A model that hedges when reviewing code hedges when analyzing a legal contract. A model that rambles before getting code wrong rambles before getting a medical assessment wrong.

Anywhere you have judgment calls — where models can disagree, and the pattern of their disagreement carries information — this approach applies. Legal review. Medical triage. Content moderation. Financial risk assessment. The principle is the same: the right architecture routes each subtask to the model whose natural calibration fits it best, and crafts the prompt that maximizes that fit.

The model-agnostic advantage

There’s one final structural angle here that I think is underappreciated.

The labs — OpenAI, Google, Anthropic — are locked into their own models. OpenAI is never going to tell you to use Gemini for detection and Opus for validation. They’re incentivized to make their suite of models work for everything. That’s a reasonable strategy for them, but it means they’ll never find the cross-model combinations that auto-tune surfaces.

Companies that aren’t locked into one family of models have an inherent advantage: they can actually search the full space. Every model, every prompt, every combination. Auto-tuning allows you to tap into that advantage — and every time a new model drops, the system can re-run and find new optimal combinations automatically.

Macroscope’s full technical deep dive covers a lot more than I could here — including the specific ML techniques they borrowed, their benchmarking methodology, and the limitations of the approach. If this topic interests you, I’d recommend reading it in full.

References

Zhang, Q., Hu, C., Upasani, S., Ma, B., Hong, F., Kamanuru, V., Rainton, J., Wu, C., Ji, M., Li, H., Thakker, U., Zou, J., & Olukotun, K. (2025). Agentic Context Engineering (ACE). ICLR 2026. https://arxiv.org/abs/2510.04618

Macroscope. (2026). We (Basically) Stopped Writing Prompts. https://macroscope.com/blog/we-stopped-writing-prompts

Every. (2026). GPT 5.3 Codex vs. Opus 4.6: The Great Convergence. https://every.to/vibe-check/codex-vs-opus