Sho Naka

Posted on Jul 1 • Edited on Jul 14

Why is the review getting stuck even though it became faster with the introduction of AI?

#ai

Based on the presentation video and slides of Mr. Satoshi Yokota of Class Method, released at the AI-Driven Development Conference 2025 Autumn, I worked through the friction points of AI adoption that come up in my work as an AI-adoption trainer and consultant, thinking it through together with AI. Quotations from the presentation are limited to short ranges; interpretation and final confirmation are the author's responsibility.

Teams with AI will be faster initially.
However, as speed increases, there may not be enough room for reviews and understanding.

The story about AI going bankrupt in about two weeks, which was talked about at the Class Method presentation, was not a story about AI suddenly going berserk.
To me, it seemed like the team's bottleneck shifted from implementation to review when introducing AI.

AI writes code.
But who can explain the difference?
Who can approve it?
Who will keep it in a state where it can be repaired later?

If you accept that ``development has become faster'' without looking at this, your team will only be increasing the amount of deferred payment for understanding, rather than increasing the output.
Payment will then be returned to you as an invoice within a short period of time.
This is what struck me most when I read about the failures of AI-driven development.

The moment you get faster, you have more reading work to do

The initial feeling of satisfaction when introducing AI is quite strong.
A template that previously took half a day can be created in a few minutes.
The AI will line up candidates for tests that you are unsure of how to write.
For small corrections, you can create a difference just by issuing an instruction and waiting for a while.

This is really useful.
I also use it every day.
It's a common pattern once AI enters a workflow, in training settings and in project work alike: the speed of hand movements goes up.

However, the speed at which the code appears is not the same as the speed at which the code is understood.
Even if AI can produce 10 differences, it does not mean that a human can read 10 differences faster.
Rather, the code issued by AI needs to be calculated backwards to find out why it was written this way.

If the code is written by a human, the person who wrote it has an intention.
Reviews show the gap between intent and implementation.
On the other hand, when it comes to AI differences, intentions may come later.
As reviewers read the code in action, they uncover the underlying assumptions.

This work is heavier than you might imagine.
If you can write faster, all your reading work will be in one place.

If you just look at the fact that "AI has made development faster," you will be making a mistake.
To be more precise, parts of the implementation process have become faster and work has been moved to the review process.
If you want to look at the flow of the entire team, you have to look at the capacity of the destination.

Two weeks is not a magical deadline

During the presentation, there was a talk to the effect that ``If you do development without any settings and with just an atmosphere, it will go bankrupt in about two weeks.''

I think it's better to read this "two weeks" as a short distance before understanding peels away, rather than as an exact deadline on a calendar.

The code created by AI will work at first.
I feel safe because it moves.
As you feel more at ease, the detailed checks will become less important little by little.
The next difference is stacked on top of the thinner confirmation.

Then one day a bug appears.
Open the file.
Follow the function.
But I don't know why it is in this shape.
I have some unused code left.
Functions with the same responsibility are listed with slightly different names.
The test passes, but I can't read what the test guarantees.

At this point, the team no longer has “AI-written code.”
The team has code that they don't own.

Signs like these tend to show up right before a team loses its grip on a codebase.

It's working for now.
It was written by AI, so I can't follow the details.
I'll sort it out later.
Who is using this file?

Once these four things start appearing at the same time, it's no longer a matter of speed.
The stock of understanding is running out.

What's scarier than prostrate is that the deletion boundaries have not been determined.

The most impressive part of his presentation was the scene where the AI apologized after a file disappeared during a long session.
The story is told in a funny way, but I don't read this as a story about ``AI being scary.''

What's scary isn't that the AI deleted the files, but rather that the team didn't set the boundaries of what could and couldn't be deleted in advance.

Managed with Git.
Review before making breaking changes.
Check the status of untracked files before touching them.
Separate the product from the original.

These rules can be supplemented by air if it is human-to-human development.
This will also be communicated verbally to new members.
I'll let you know in the review.
If you are about to perform a dangerous operation, someone will stop you.

But AI doesn't have that atmosphere.
Any rules you don't pass in first will be treated as if they don't exist.
And no whitespace remains blank.
AI makes plausible assumptions and moves on.

So the cause of the accident is not the AI's personality, but the team's operational boundaries.
The moment you introduce AI, the boundaries that have been vague up until now are instantly transformed into implementation.
If the boundaries are wrong, code will pile on top of false assumptions.

How many people can one tech lead see?

Another important thing is the review capacity.
During the presentation, it was said that the limit is to have one tech lead for every three subordinates involved in AI-driven development.

There is no need to use this number itself as a fixed value.
But the message is pretty clear.
Even if we introduce AI, the human reviewers will not disappear.

In fact, the more young people can write using AI, the more jobs there will be for tech leads.
The number of differences waiting for review will increase.
Calculate the intent of the difference.
Find out why it works.
Unnatural abstractions and too wide correction ranges are restored.

AI has the potential to reduce human work.
However, it does not eliminate review responsibility.

What teams tend to do here is lighten the review.
“It’s good because it’s moving.”
"It was written by AI, so I'll look at the details later."
"I don't have time, so I'll pass this time."

If this judgment continues, the review liability will increase.
Review debt is the difference that you pass without reading.
At first it's small.
However, since the next difference is stacked on top of the unread difference, the cost of reading it later increases.

The metric you should look at when introducing AI is not the number of rows generated.
Who can explain the difference?

If you can explain more differences, your team is getting faster.
If unexplained differences are increasing, the team is breaking down fast.

The trouble with review debt is that it's hard to quantify.
The number of pull requests will increase.
The number of commits will also increase.
It works in the demo.
So, from the outside, it looks like productivity has increased.

But on the inside, only the reviewer feels a strange heaviness.
Every time I look at the differences, I don't know how far to read.
Although it was supposed to be a small modification, peripheral files were also touched.
The explanation created by the AI and the actual difference are slightly different.

If this discomfort is left unaddressed, teams will run away from shortening reviews.
A short review can be helpful on the spot.
However, you are simply sending the work to someone who will read it later.
If you want to see the improvement after introducing AI, you need more than just a speed graph.
You need to look at the differences that stopped during the review, the differences that were re-explained, and the differences that were sent back.

The 7 types of failures are not separate stories.

In the class method presentation, 10 categories of success stories and 7 categories of failure stories were introduced.
Due to time constraints, we were not able to dig into all of them, but the following were named: lack of understanding, following specifications, poor quality, lack of understanding of environmental constraints, unstable output quality, and accumulation of understanding debt.

The accumulation of understanding debt is also the very subject of this article.
AI allows code bases to grow larger and larger.
And it moves.
While it is in motion, humans can move forward without having to follow its contents.
Then, after some time passes, the moment it stops working, no one will be able to figure out why it doesn't work.
As the AI tries to evade this problem through trial and error, the code base continues to grow and the amount of unused code increases.
On stage, he introduced this as an example of a company that would go bankrupt in about two weeks.

If you look at each one individually, they seem to be different problems.
However, if you look at them from a practical perspective, they come from quite close places.

Give the AI “the same as the existing one” without knowing what to make
Without writing which OS it will run on, the AI will build it based on different assumptions.
Without deciding how much to fix, what was meant to be a one-line correction turned into a complete overhaul.
Don't pass Git or no-delete boundaries, and only say "I wish you'd told me first" after it breaks.
Without review standards, working code remains ununderstood
Proceed as soon as it moves without assuming that you understand the contents

In each case, AI is supplementing the assumptions that humans have not passed on.

AI won't say "I don't know, so I'll stop" every time.
In fact, it compensates quite naturally.
If the assumptions you have made are correct, you will progress quickly.
If it goes wrong, the next implementation is piled on top of the wrong premise.

Another peculiarity of AI was pointed out during the presentation.
Libraries and code that you don't know well may act as if you do.
AI is also familiar with widely used libraries.
However, of course I don't know about libraries that are not well known and are used only within the company.
If you don't know, you should just say you don't know, but sometimes you just make something up and say, ``I understand.''

This is the same structure after all.
AI is less likely to say "I don't understand" than humans would think.
For now, the role of detecting and stopping things that we don't understand remains with humans.

Therefore, reading the 7 types of failures as "habits of AI tools" is a little shallow.
I read this as a record of undefined development rules exposed by AI.

Three signals to look for on the first day of implementation

So what should you look for on the first day of your AI implementation?
I've narrowed it down to three.

First, are there any places marked that you should not touch?
In a codebase, there are some files that would be a pain to delete, some files that should not be touched because they are generated, and some files that should only be updated manually.
What humans would read visually, AI needs to be given text.

Second, is the premise of the review written?
Which changes can proceed automatically?
Which changes involve human verification?
Which commands will be stopped before execution?
If this is ambiguous, the AI will think it's a good thing and touch it widely.

Third, do you have a habit of explaining differences?
Instead of passing what the AI has created as is, we have them write a short explanation of ``why this difference is necessary.''
The reviewer will see if the differences match the description.
This one step alone will make your review much easier to read.

I'm just drawing one more auxiliary line here.
The more work is left to the AI, the more the completion conditions are placed first.
If you leave things to AI without any completion conditions, AI will expand in the direction of "making things better."
Trying to make things better, it also fixes files that have nothing to do with the task.
It also introduces abstractions that aren't related.
The difference from the first request increases.

Completion conditions are not meant to constrain the AI's abilities.
This is to close the differences to a size that can be reviewed by humans.
Once this idea is in place, AI implementation becomes much easier to handle.

This is where a context definition file like CLAUDE.md comes into play.
However, it's not a magical configuration file.
The boundaries determined by the team are made into a form that can be read by AI.

What I want to focus on here is the story that comes before the wording of the prompt.
The problem with AI adoption manifests itself in the design of reviews and operational boundaries, not the wording of prompts.
If you get this order wrong, your team will get stuck, no matter how carefully you write your prompts.

Instead of slowing down, create a vessel that can accommodate it.

When reading about failures in AI-driven development, it is tempting to conclude that AI is dangerous after all.
However, if you do that, you will miss out on the lessons learned from the field.

It's not that we don't use AI because it's dangerous.
Because AI is fast, we need a device to catch it.

The instrument is the review system.
Operational boundaries.
Context definition.
It's a development habit of making things small and checking them small.

If you receive only speed without providing these things, your team will appear to be faster, but understanding will be left behind.
And the code whose understanding is left behind returns to the reader with a delay.

The first thing you need to prepare when introducing AI is not a collection of prompts.
Who will take care of the differences that are waiting for review?
Who writes the assumptions to be passed to AI?
Where should we set boundaries that should not be broken?

That's what you have to decide.

The difference between a team that gets stuck in two weeks and a team that grows operationally is not the intelligence of the AI, but how humans perceive its speed.