Ahmed Ibrahim

Posted on Mar 5 • Originally published at levelup.gitconnected.com on Mar 4

The AI Code Review Bottleneck Is Already Here. Most Teams Haven’t Noticed.

#ai #programming #devops #security

A 3-lane, 90-day playbook for teams that ship AI-assisted code.

I’m not anti-AI, but I am anti-surprises. I’ve been working in infrastructure long enough to know how things break, and it’s almost never dramatic. Nobody deploys a rootkit on purpose. Unless they’re the bad guy, and in that case: congrats on being proactive. What actually happens is someone pastes a “quick helper” into a repo on a Friday afternoon, the code compiles, tests are green, and everyone wants to go home. Two weeks later, you’re on a call at midnight because the helper logs the full request body including the authorization header “just for debugging.” Or it catches every exception and silently returns success, so the function never actually fails, until it fails in a way nobody notices for three days. Nobody did anything malicious. It just happened fast.

That’s the shift worth paying attention to. We can now generate a lot of code, quickly, and it often looks confident while being slightly wrong. Stack Overflow’s 2025 Developer Survey captures the tension across multiple AI questions: 84% of respondents say they are using or planning to use AI tools (Stack Overflow Developer Survey 2025 — AI). When asked about AI accuracy, more developers distrust it than trust it (46% vs 33%) (Stack Overflow Developer Survey 2025 — AI). And when asked about frustrations, “almost right, but not quite” is the most commonly cited issue (66%) (Stack Overflow Developer Survey 2025).

If you’ve ever debugged someone else’s code, you know that almost right is sometimes worse than completely wrong. At least completely wrong fails loudly.

The Veracode 2025 GenAI Code Security Report put numbers on what that looks like in practice: 45% of AI-generated code samples introduced OWASP Top 10 vulnerabilities across 100+ large language models, with Java hitting a 72% security failure rate and Python, C#, and JavaScript all falling in a similar range. (Veracode 2025 GenAI Code Security Report) And this isn’t just a lab finding anymore. Aikido Security’s 2026 State of AI report, surveying 450 developers and security leaders across Europe and the US, found that one in five organizations have already suffered a security incident caused by AI-generated code, and 69% have found vulnerabilities introduced by it in their own systems. (Aikido Security: State of AI in Security & Development, 2026)

So this isn’t the scare piece. This is the boring follow-up where we actually do something about it.

The Principle That Makes Everything Else Work

Don’t create an “AI lane.” Create risk lanes. There’s a temptation to treat AI-generated code as something that needs its own special review process, a separate track with a label and a checkbox on the PR template. The intent is usually good because we want visibility into what’s being generated versus what’s being written by hand.

But there’s a real risk this backfires. A KPMG and University of Melbourne study surveying over 48,000 workers across industries in 47 countries found that 57% of employees conceal how they use AI at work. (KPMG Trust in AI, 2025) And a study published in Harvard Business Review showed that when engineers evaluated identical Python code, they rated the author’s competence 9% lower if they believed AI was used, same code, lower score, just because of the label. (HBR: The Hidden Penalty of Using AI at Work, 2025)

These are studies about perception and behavior broadly, not about engineering teams specifically, but the pattern they describe is hard to ignore. If you build a process that singles out AI-generated code, you’re likely creating an incentive to hide it, and then you end up with the worst of both worlds: AI-generated code everywhere, zero visibility into where it is.

Here’s what I think works better. Route reviews by what the code touches, not by who or what wrote it. In practice, that means splitting your changes into something like three lanes:

Fast lane: Documentation, comments, test descriptions, CSS and styling, localization strings. These carry minimal blast radius. No CODEOWNERS requirement on these paths, lighter CI checks (skip SAST, skip IaC scanning), standard branch protection still applies. One reviewer, automated checks pass, merge and move on.

Standard lane: Application logic, API endpoints, frontend components, database queries that don’t touch schema. This is most of your PRs and your default review process. These can still introduce security issues, and that is what your SAST checks and review process are for. One or two reviewers, all status checks green, CODEOWNERS approval where relevant.

Critical lane: Anything that touches authentication or authorization logic, CI/CD workflows and pipeline definitions, infrastructure-as-code, secrets management, database schema migrations, CODEOWNERS itself, or network and firewall rules. Enforce this by adding CODEOWNERS entries for critical paths (like .github/workflows/** and your infra/ directory, for example) and requiring code owner approval in your branch protection rules. That turns the lane from a suggestion into a gate. The designated reviewer actually understands the blast radius of the file they're approving.

The lanes aren’t about AI. A human writing a Terraform change that opens port 22 to the internet is just as dangerous as Copilot doing the same thing. The point is that your review effort goes where the damage potential actually lives, and code that can’t hurt you in production doesn’t sit in a queue waiting for the same level of scrutiny as code that can.

One thing to watch for: people can try to work around the lanes for many reasons. Someone splits a PR so the auth logic change lands in one diff and the “harmless refactor” that makes it work lands in another. Or they rename a file to dodge a CODEOWNERS path. The mitigations are: keep your CODEOWNERS paths broad enough to catch common renames (own the directory, not just the filename), add CI checks that scan for security-sensitive patterns like credential handling or permission changes regardless of which file they appear in, and be honest that if someone is actively working around your review process, you have a trust problem that no amount of tooling will fix on its own.

The Volume Problem Nobody Planned For

Before we get into the playbook, it’s worth understanding why this is urgent rather than just important.

AI tools don’t just change what code looks like, they change how much of it shows up in your review queue. Faros AI published research in 2025 based on telemetry from over 10,000 developers across 1,255 teams. Teams with high AI adoption completed 21% more tasks and merged 98% more pull requests, which sounds great until you see the other side: PR review times increased by 91%, and the PRs were also larger. (Faros AI: The AI Productivity Paradox, 2025). The bottleneck moved. It used to be writing code. Now it’s reviewing it, and most teams haven’t adjusted.

I want to be honest here: the playbook below does not magically solve the volume problem. Nobody has a clean, proven answer to “how do you review twice or triple as many PRs without as many senior engineers.” What the controls below are designed to do is make sure the increased volume goes through actual checks instead of getting rubber-stamped because the reviewer has 47 PRs in their queue and a sprint review in two hours.

The closest thing to a real strategy right now is layering. Automated checks catch the surface-level problems before a human ever opens the diff. Risk-based routing through CODEOWNERS means expensive human attention goes where it actually matters, which is why the lane system above exists: your senior engineers should never spend their limited review time on a docs change when there’s a workflow permission change three PRs down in the queue. Generation-time guardrails like AGENTS.md mean the PR that arrives in your queue is already cleaner because the agent ran linting and tests before opening it. And AI-assisted code review tools like GitHub Copilot code review or CodeRabbit are becoming a practical first-pass layer that catches obvious bugs and known vulnerability patterns before a human reviewer ever sees the diff. None of these layers are perfect on their own, and the AI-assisted review tools in particular are still early enough that your team will spend the first few weeks calibrating what to ignore versus what to act on.

But the net effect is that reviewers spend their time on logic, architecture, and security design instead of catching hardcoded secrets and missing null checks. That’s the difference between a review process that scales and one that quietly collapses under weight.

Day 1: Stop the Bleeding

The lanes decide who reviews what. The 90-day plan below decides what controls run before a reviewer ever sees the diff.

Day 1 is about what you can do this week, not in a perfect world, not after the next planning cycle.

Protect the branch, protect your future self. Start with the basics that everything else depends on: disable force pushes and branch deletion on protected branches. A bad PR that gets merged leaves a trail you can investigate, but a force push rewrites that trail entirely, and a deleted branch takes it with it. Once those are locked down, build on top of them by requiring pull requests for main, requiring at least one reviewer, and requiring status checks before merge with build and tests at minimum. This isn’t about distrust, it’s about stopping “oops” from becoming “incident.” If you don’t have this foundation already, everything else in this article is academic.

CODEOWNERS for the files that can actually hurt you. Not every file in your repo carries the same risk. A CSS change and a workflow permission change are not the same thing, and pretending every reviewer is equally qualified for both is how you end up with a junior approving a change to your CI/CD pipeline because the diff looked small. Add CODEOWNERS for .github/workflows/**, your infrastructure directory, wherever your authentication logic lives, and the CODEOWNERS file itself, because people are creative. If you enable "Require review from Code Owners," GitHub enforces that the right people approve the right files. This is where the critical lane becomes real: CODEOWNERS turns “this needs the right reviewer” from a suggestion into a gate.

Secret scanning with push protection. Push protection stops the commit before the secret reaches the remote, so turn it on even if you think your team would “never” commit secrets, because they will, just not on purpose. With AI tools generating config files and helper scripts at volume, the probability goes up, not down.

Dependency scanning and a basic SAST pass. You don’t need perfect tooling on Day 1, you need consistent signal. Turn on dependency alerts and run a basic SAST scan on PRs. It will be noisy, but you’re not trying to catch everything yet, just trying to stop shipping something obviously avoidable while you build the rest.

AGENTS.md if your team uses AI coding agents. This one surprised me. AGENTS.md is an open format that multiple AI coding agents now support, including Codex and Cursor among others (see agents.md for the current list). Think of it as a README but for agents: you put it in your repo root and it tells the agent things like “run linting before opening a PR,” “never modify workflow files without flagging for review,” and “do not commit credentials even in test files.”

It’s not enforcement, since the agent could still ignore it the same way a human could ignore your CONTRIBUTING.md. But it creates a shared expectation at the repo level, which means you’re not relying on every developer to individually configure their AI tool correctly. If you’re setting up repo templates for new projects, add an AGENTS.md alongside your CODEOWNERS file. It sets a baseline before the first AI-generated PR ever lands.

Day 1 goal: reduce the probability of shipping something that would embarrass you in an incident review.

Day 30: Make Quality Repeatable

By Day 30 you’ve had the Day 1 controls running long enough to see what they catch and what they miss. You’ve probably had at least one PR where the status checks saved you from something, and at least one where they didn’t catch something they should have. That’s the signal you use to tighten things up.

Mandatory checks before merge. At this point, make these non-negotiable in your branch protection: build, tests, dependency scan, secret scan, and an IaC scan if infrastructure-as-code lives in the repo. If a PR can bypass any of these and still land in main, you have a policy, not a control. A policy says “we expect people to do this”, but a control means the system won’t let you skip it. By Day 30, you should be running controls.

AI-assisted code review as a first pass. This is the most direct answer to the volume problem from earlier. Tools like GitHub Copilot code review or CodeRabbit can review every PR before a human touches it. The better ones combine static analysis with LLM reasoning, so they can tell whether that SQL query is actually dangerous in context rather than just flagging every string concatenation. Your human reviewers should be spending their time on whether the architecture makes sense and whether the security design holds up under edge cases, not on spotting a hardcoded API key in line 47. For context on cost, GitHub Copilot Business, which includes the code review feature, is $19/user/month. Third-party alternatives have their own pricing. The specific tool matters less than the principle: if your team generates more PRs than your reviewers can thoughtfully evaluate, you either add an automated first-pass layer or you accept that human review becomes performative.

SBOM as a practical upgrade. I didn’t think much about Software Bill of Materials until I thought through what happens without one. If a customer asks what libraries are inside your product, the answer is an actual list instead of silence and a follow-up email. If a dependency you didn’t even know was in the chain turns out to have a CVE, impact analysis becomes a query instead of detective work. If you want practical guidance on how SBOM fits into broader supply chain security, SLSA (Supply-chain Levels for Software Artifacts) is worth reading. It’s an OpenSSF framework that provides incremental levels for improving supply chain integrity from basic provenance tracking up through tamper-proof builds. (SLSA Framework)

Day 30 goal: the quality of what ships no longer depends on who happens to be reviewing that day.

The Line Between Suggesting and Executing

There’s a line that a lot of guardrail discussions skip over, and I think this is an important distinction in this conversation.

When AI is only generating text in your IDE, you’re dealing with a code quality problem where the developer sees the suggestion, maybe accepts it without reading carefully, and it goes through whatever review process exists. The human is still in the loop, even if they’re not paying enough attention.

When AI can execute actions, that is a different problem entirely. Tools like Codex , Cursor in agent mode, and Cline don’t just suggest code, they read your repository, run terminal commands, modify files across your codebase, and create pull requests autonomously.

Research on LLM agent security has been accelerating. A comprehensive survey published in late 2025 on attacks and defenses targeting LLM-based agents identifies how tool use and iterative execution expand the attack surface compared to single-turn text generation - an agent that can read files and execute commands is a fundamentally different risk surface than a chatbot that answers questions. The practical risks include prompt injection through repository content, exfiltration of code or secrets through tool calls, and manipulation of agent behavior through poisoned context.

Two disclosed CVEs show what this looks like when it reaches production. CVE-2025–53773 is a command injection vulnerability in GitHub Copilot and Visual Studio (CVSS 7.8, patched August 2025). CVE-2025–54135, nicknamed “CurXecute,” is a similar class of issue in the Cursor editor (CVSS 8.6 per vendor advisory, patched in Cursor 1.3.9, July 2025). In both cases, public write-ups describe how prompt-injection style inputs in files an agent reads can translate into unintended command execution. The agent does exactly what it was told, just not by the developer. (Survey: Security of LLM-based agents, ScienceDirect 2025) (Embrace The Red: CVE-2025–53773) (Aim Security: CurXecute)

What this means for the playbook: execution rights require sandboxing, least privilege, and strong audit trails, and the controls in Day 90 are the minimum when you have tools that can run commands.

Day 90: Make It Survive the Worst Week

By Day 90 the basics are muscle memory. PRs get reviewed, checks run, secrets get caught. The question shifts from “are we doing something?” to “could we survive an audit, an incident, or a very pointed question from a customer?”

This phase is about one thing: so you can answer what happened, who approved it, and why? without scrambling through Slack threads at 2 AM.

Least privilege identities per environment. This sounds obvious until you audit what’s actually running in your pipelines. In practice, it means no long-lived production credentials stored as CI secrets. Replace them with short-lived identities: OIDC federation between your CI provider and your cloud, workload identity in Kubernetes, or federated tokens where those aren’t available. In GitHub Actions, that looks like configuring the id-token: write permission and using your cloud provider's OIDC login action instead of storing a long-lived secret. AWS, Azure, and GCP all support this pattern, and the setup is similar across all three: you create a trust relationship between your CI provider and your cloud identity system, scoped to a specific repo and branch, so the token only works for that pipeline running against that branch. Scope RBAC tightly and separate dev, staging, and production identities completely, because if a credential can reach production, it is production infrastructure whether you labeled it that way or not. Teams that do this audit almost always discover at least one identity with broader access than anyone intended.

Environment approvals and separation of duties. Production deployments should require an approval gate, and that approval should be auditable. In GitHub Actions, this means configuring environment protection rules on your production environment: add required reviewers (at least one person who is not the PR author), and enable "Prevent self-review" so the person who triggered the deployment cannot also approve it. The deployment logs then show exactly who approved, what commit SHA was deployed, and what time the approval was given. That paper trail is the difference between a 30-minute incident review and a three-day forensic investigation.

Runner isolation for sensitive jobs. If a runner can deploy to production, treat it as production-adjacent infrastructure. For self-hosted runners, use dedicated runner groups for production deployments with labels like runs-on: production-deploy, restrict those runners so only specific workflows can use them, and put them in a separate network segment with access only to production endpoints and your artifact registry. Ephemeral runners that spin up clean for each job and get destroyed after are even better, because you eliminate the possibility of state leaking between workflow runs. If you're on GitHub-hosted runners, be aware that standard hosted runners share infrastructure and don't offer network-level isolation out of the box. Private networking options exist but they require enterprise-tier plans and additional cloud configuration, so evaluate whether your deployment security requirements justify self-hosted runners instead. This matters more now that AI tools generate deployment scripts and workflow files at volume, because the damage a bad workflow can do is not "this PR has a bug," it's "this workflow has permissions to push artifacts to production and nobody noticed the scope."

Policy as code for the patterns that should never pass. By Day 90, certain things should be automatically blocked before they reach a reviewer. Use Open Policy Agent (OPA) with Rego policies, or if you’re in GitHub, branch rulesets combined with repository rules. The patterns to automate first: workflows requesting permissions: write-all or contents: write without explicit justification, pull requests that modify .github/workflows/** without approval from the platform team (enforced through CODEOWNERS plus a required status check that validates the approval), infrastructure-as-code changes that open inbound access from 0.0.0.0/0 on non-HTTP ports (which means "allow the entire internet to connect," and is one of the most common cloud misconfigurations when someone writes a security group rule for SSH or a database port and forgets to restrict the source IP range), and deployment artifacts that are unsigned when your policy requires signing. You can enforce these through pre-merge checks that run OPA against the PR diff, or through CI steps that validate the final configuration state before deployment proceeds. Start the highest-confidence rules as blocking. For noisier ones, start as warnings but set an explicit deadline to tune or delete them, because advisory mode without a deadline is just a more expensive way to generate alerts nobody reads.

Day 90 goal: your security posture survives the worst week of the year, not just the best.

Final Thought

If you’re already using AI tooling in your development workflow, you’re not early, you’re normal. The Stack Overflow numbers say 84% of developers are in the same position, and the real question is whether your guardrails grew alongside the tools or whether they’re still where they were two years ago when the main concern was someone copy-pasting from the wrong Stack Overflow answer.

Every section in this article is something you can start without a massive initiative. Day 1 is branch protection and CODEOWNERS. Day 30 is mandatory checks and automated review layers. Day 90 is least privilege and policy as code. None of it requires a new department or a six-month project, it just requires deciding that the speed of shipping doesn’t get to outrun the speed of knowing what you shipped.

The controls that protect you from a bad human commit are the same ones that protect you from a bad AI-generated commit. The only difference is volume, and now you know where to start.

Sources

Stack Overflow 2025 Developer Survey — 49,000+ respondents (May-June 2025). Stats cited are from separate questions in the AI section: 84% using or planning to use AI tools; 46% distrust vs 33% trust AI accuracy; 66% cite “almost right, but not quite” as the most common frustration.

Veracode 2025 GenAI Code Security Report — Tested 80 coding tasks across 100+ LLMs in Java, Python, C#, and JavaScript. Published July 2025. Stats cited: 45% introduced OWASP Top 10 vulnerabilities, Java at 72%. (Full PDF)

Aikido Security: State of AI in Security & Development 2026 — Survey of 450 developers, AppSec engineers, and CISOs across Europe and the US. Found one in five organizations suffered a security incident from AI-generated code, 69% found vulnerabilities in AI code.

KPMG and University of Melbourne: Trust in AI 2025 — Global study of 48,000+ workers across industries in 47 countries. Found 57% of employees conceal AI usage at work.

Harvard Business Review: The Hidden Penalty of Using AI at Work — Controlled experiment with 1,026 engineers evaluating identical Python code. Found 9% competence penalty when reviewers believed AI was used. Published August 2025.

Faros AI: The AI Productivity Paradox 2025 — Telemetry from 10,000+ developers across 1,255 teams. Published July 2025. Found 98% more PRs merged, 91% longer review times.

SLSA Framework — Supply-chain Levels for Software Artifacts. OpenSSF project. Incremental levels for supply chain security.

Security of LLM-based Agents: A Comprehensive Survey — Academic survey published on ScienceDirect (2025) covering attack methods, defense mechanisms, and real-world vulnerabilities in LLM agent systems.

CVE-2025–53773: GitHub Copilot and Visual Studio — CVSS 7.8 (High). Command injection issue; patched August 2025. Additional analysis: Persistent Security.

CVE-2025–54135: Cursor RCE via MCP Prompt Injection (CurXecute) — CVSS 8.6 (High) per vendor advisory. Poisoned MCP server data could rewrite global MCP config and execute attacker-controlled commands. Patched in Cursor 1.3.9, July 2025. Detailed exploit chain documented by Aim Security.

AGENTS.md Open Format — Open format for project-level instructions to AI coding agents. Supported by multiple tools including Codex and Cursor. Stewarded by the Agentic AI Foundation under the Linux Foundation.

DEV Community