Paulo Victor Leite Lima Gomes

Posted on Jun 27

AI code governance is the new code review bottleneck

#ai #governance #codereview #developertools

The most believable AI coding story right now is not that everyone is suddenly shipping ten times faster.

It is that many teams are producing more code than their review systems were designed to absorb.

GitLab released new AI accountability research this week with a very familiar shape. Adoption is high. Output is faster. Leaders see ROI. Developers are using multiple AI coding tools. Then the uncomfortable part arrives: 85% of respondents agree that AI has shifted the bottleneck from writing code to reviewing and validating it, and 84% agree that the biggest challenge is governing what happens to AI-generated code after it is created.

That feels right.

For years, the sales pitch was "AI will help you write code." Fine. It does. Sometimes well, sometimes badly, often usefully enough.

But writing code was never the whole job.

The job is deciding whether this code should exist, whether it belongs here, whether it behaves correctly under boring production conditions, whether anyone can maintain it later, and whether the team is willing to own the consequences after the model has moved on to the next task.

That is not a generation problem.

That is a governance problem.

generation is the easy demo

Code generation demos are satisfying because they have a clear before and after.

You describe something. The tool writes files. Tests maybe appear. A UI renders. A pull request opens. Everyone can point at the artifact and say the machine did work.

Governance is harder to demo because the output is mostly absence.

The unsafe dependency did not get merged. The generated migration did not break rollback. The agent did not invent a second billing path. The reviewer caught the place where the implementation matched the prompt but violated the system. The team could explain where a generated change came from during an incident. The audit trail existed when someone needed it.

That is less exciting than watching a tool build a feature from a paragraph.

It is also closer to the real work.

The GitLab numbers are interesting because they separate speed from control. According to the release, 78% of organizations say developers are writing and committing code faster since adopting AI tools. At the same time, 43% say they cannot reliably distinguish AI-generated code from human-written code in their own codebase, and 82% say AI-generated code risks creating a new kind of technical debt they are not prepared to manage.

That is the AI paradox in one sentence: the input got cheaper, but the output still has to live in a system.

review is now a production system

Code review used to be uncomfortable enough when humans wrote all the code.

Now add background agents, generated tests, speculative refactors, prototype code promoted too quickly, and pull requests written by people who may not fully understand every line the tool produced.

The review queue becomes the pressure valve.

If it works, AI coding can be leverage. If it fails, the organization just found a faster way to manufacture uncertainty.

This is why I think review needs to be treated less like a social ritual and more like a production system.

It has inputs, queues, service levels, failure modes, ownership, and saturation points. It can be overloaded. It can drop important work. It can create hidden toil. It can reward the wrong behavior if all the incentives point at "more code shipped" and none point at "less code rejected later."

The old review process may not survive the new code volume.

That does not mean every company needs a giant governance platform tomorrow. It does mean teams should stop pretending that a human rubber stamp at the end of an AI-heavy workflow is enough.

Review has to move earlier.

Before the agent starts, the task should define boundaries: allowed files, forbidden dependencies, migration constraints, test expectations, security assumptions, and what evidence the agent must produce.

During the work, the system should retain traces: prompt, model, tool calls, commands, files read, tests run, failures encountered, and human interventions.

After the work, the pull request should show enough context for a reviewer to make a decision without becoming a forensic archaeologist.

That is governance.

Not a committee. Not a 40-page policy. The minimum operational machinery required to trust work you did not personally type.

provenance is not a luxury

One of the most uncomfortable GitLab findings is the traceability gap.

Organizations are confident they could determine whether AI-generated code contributed to a production incident, but many that had an incident in the past year could not actually make that determination.

That is exactly the kind of thing teams discover too late.

In calm planning meetings, everyone assumes the history will be reconstructable. The pull request is there. The commit is there. The issue is there. The agent transcript is probably somewhere. Someone can search Slack. Someone remembers which tool was used.

Then an incident happens, the customer is angry, the security team asks for an answer, and the evidence is spread across five products, two browser tabs, a local terminal history, and a summary written by a model that no longer has the same context.

Provenance matters because responsibility needs a trail.

Where did this change come from? Was it generated, edited, or hand-written? Which instructions guided it? Which tests passed? Which warnings were ignored? Who approved it? Was the model allowed to use external sources? Did it read the right internal docs? Did a human change the risky part later?

Without that trail, "AI-generated" becomes a vibe, not a useful engineering fact.

And vibes are terrible incident artifacts.

governance cannot mean blame

There is a bad version of AI code governance that I hope we avoid.

It turns into a surveillance layer for developers.

Every line is labeled. Every model interaction is scored. Every generated commit becomes a compliance object. Managers start asking why one person produced more AI code than another. Reviewers become risk clerks. Developers hide tool usage because the audit trail feels like a trap.

That would be a waste.

The goal should not be to shame people for using AI or to create a purity test between human and generated code. The codebase does not care whether a line began in a model, an autocomplete, a snippet, a Stack Overflow answer, or a tired engineer at 11 p.m.

The codebase cares whether the line is correct, maintainable, secure, observable, and owned.

Governance should make those questions easier to answer.

A useful system says: this change was agent-assisted, these were the instructions, these files were touched, these tests were run, these risks were declared, this human approved it, and this is the evidence trail if we need to inspect it later.

A bad system says: here is a leaderboard of who used the most AI.

One helps teams operate software.

The other recreates lines of code with a shinier dashboard.

small teams need this too

It is tempting to treat AI governance as an enterprise problem.

Big companies have compliance departments, procurement processes, security review, audit requirements, and enough tool sprawl to make everything feel like a platform problem. Of course they need governance.

But small teams have the same underlying issue, just with fewer meetings.

A three-person startup can also merge generated code nobody understands. A solo maintainer can accept an agent-written dependency update that passes tests but changes a subtle behavior. A tiny product team can prototype with AI, ship the prototype, and then spend six months living inside decisions nobody remembers making.

The difference is that small teams usually cannot buy their way out with process.

They need lightweight habits.

Write down repository instructions. Keep generated pull requests smaller than the tool wants to make them. Require tests that prove behavior, not only implementation details. Ask the agent to explain tradeoffs and rejected approaches. Save transcripts for risky changes. Make reviewers check the task boundary, not just the diff. Delete generated code aggressively when nobody wants to own it.

Most of that is not expensive.

It is discipline.

the new bottleneck is judgment

The interesting career angle is that AI does not make engineering judgment less valuable.

It makes judgment the scarce part.

If writing a first draft of code gets cheaper, the valuable engineer is the one who can decide whether the draft is any good. That means reading code carefully. Understanding system boundaries. Knowing when a test is meaningful. Seeing when a change is too large. Naming the hidden dependency. Rejecting plausible nonsense. Explaining why "works" is not the same as "belongs."

This is uncomfortable because judgment is harder to teach than syntax.

It is also harder to measure. You can count generated lines. You can count merged pull requests. You can count review comments. It is much harder to count the bad migration that never happened because someone asked one annoying question at the right time.

But that is the work.

The teams that get better at AI-assisted development will not simply be the teams with the strongest models. They will be the teams that turn judgment into reusable systems: task templates, review checklists, repository instructions, traceable evidence, good tests, clear ownership, and a culture where rejecting generated work is normal.

That last part matters.

If the organization treats every AI pull request as productivity that must be preserved, reviewers will feel pressure to salvage bad work. Sometimes the correct review is "no." Sometimes the best outcome is deleting the branch. Sometimes the agent did exactly what it was asked and the request was wrong.

Governance has to make that acceptable.

the punchline

AI coding has made writing code cheaper, but it has not made software cheaper to own.

That is the part hidden inside the GitLab research. The bottleneck is moving from generation to review, validation, provenance, and accountability. The hard question is no longer "can the tool produce code?"

Of course it can.

The hard question is "can the team control what happens after the code exists?"

That means knowing where code came from, what it was supposed to do, how it was checked, who approved it, and who owns it in production.

If teams build that machinery, AI coding can become real leverage.

If they do not, they will generate code faster than they can understand it, merge it faster than they can govern it, and call the cleanup "technical debt" later.

Which is technically correct.

But also a very expensive way to learn that code review was the product all along.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

Top comments (2)

Alex Shev • Jun 27

AI code governance is becoming the review bottleneck because the unit of risk changed. Reviewers are no longer checking only a diff; they are checking whether the prompt, context, generated code, tests, and assumptions line up. That needs better metadata around the change, not just more reviewer patience.

alberfranquesa • Jul 7

The point about governance meaning "minimum operational machinery to trust work you didn't type" is exactly right. One practical implication: review needs a spec, not just a diff. When a PR arrives from an agent, the reviewer's job is to check whether the agent stayed within declared boundaries like allowed files, forbidden patterns, test coverage expectations. That's auditable. "Does this look right to me?" at merge time isn't.