I stopped thinking of AI as a tool.
Copilot, Cursor, Claude Code — once I started treating these not as "convenient assistants" but as "talented teammates," the way I develop software changed entirely.
And from that perspective, a simple question emerges.
If an exceptional coder and an exceptional reviewer were already on your team, where should the team focus its resources?
Don't Treat AI as Special — Just as an Exceptionally Talented Human
First, an important premise.
AI is not special. Treat it the same as a highly skilled human engineer.
AI can already write code, open pull requests, and leave review comments. The range of what it can do is expanding rapidly, and the justification for treating it differently "because it's AI" is fading. Think of it simply as having an exceptionally talented human engineer join the team.
With that premise in place, the opening question takes on meaning.
The Answer Is: Focus on Review
The conclusion first.
The team should focus on the review side — the process of questioning answers.
The reason is simple.
Generation (opening PRs) — let AI go all out.
Writing code is the process of "producing an answer given a set of requirements." This is where AI excels, and there's little reason for humans to run alongside it.
Review (questioning answers) — requires a perspective outside the generation context.
This is the core insight. As long as you remain inside the same context in which the code was generated, you cannot question it from the outside. Conversely, even the same model can function as a reviewer if it operates in a fresh session, disconnected from the generation context. Being outside the context is precisely what makes it possible to question the answer. Anthropic's Code Review, discussed later, is built on this same principle. Whether AI or human — "being outside the generation context" is the condition for effective review.
"Does this code work?" is not the question. "Was this the right implementation?" "Is this tradeoff really justified?" — raising these questions requires a perspective that differs from the one that produced the code. Focusing resources on review is what makes these questions possible.
Maybe Not a Coincidence — What Anthropic Is Doing
On March 9, 2026, Anthropic released Code Review for Claude Code.
The feature automatically dispatches multiple AI agents in parallel whenever a PR is opened, detects bugs, ranks them by severity, and feeds the results back into GitHub. Anthropic applies this system to nearly every PR internally, and the proportion of PRs receiving substantive review comments jumped from 16% to 54% as a result.
What's worth noting is the design philosophy. It has been reported that the tool focuses on logic errors, not style or naming conventions. Could this reflect a deliberate choice to leave style correction to the generation side?
This might be an embodiment of the idea: "rather than loading the generation side with detailed instructions, tighten the exit gate (review)."
Overloading CLAUDE.md Backfires
Meanwhile, research is emerging that questions the approach of controlling generation through detailed instructions.
A paper published on arXiv in February 2026, "Evaluating AGENTS.md" (ETH Zurich et al.), found that loading context files like CLAUDE.md or AGENTS.md with detailed instructions reduces task success rates and increases inference costs by more than 20%.
AI tries to follow instructions, but in doing so generates unnecessary exploration and testing that gets in the way of the actual task. The paper's conclusion is simple: "Unnecessary requirements make tasks harder. Context files should contain only minimal requirements."
Keeping instructions minimal — this direction may be empirically supported as well.
Skills Fall Into the Same Trap as CLAUDE.md
Agent Skills has been gaining attention recently. It's an open format for giving agents procedural knowledge via SKILL.md files, enabling capability expansion.
There are valid uses. Tasks like generating PPTX files or operating internal proprietary tools — adding capabilities that agents don't have out of the box. In that sense, it differs from CLAUDE.md.
But misused, it falls into exactly the same trap.
Consider writing a rule like "always include the time, not just the date, in Rails migration filenames" as a Skill. This is just injecting a corrective instruction as a workaround for unstable output — structurally no different from writing it in CLAUDE.md.
A practical rule of thumb: if it can be replaced by a linter or automation tool, it shouldn't be a Skill.
| Skill type | Verdict |
|---|---|
| Cannot be replaced by linters etc. (capability expansion) | Valid |
| Can be replaced by linters etc. (rule correction) | Same trap as CLAUDE.md |
When you find yourself wanting to create a Skill to enforce a rule, first ask whether a linter or automation tool could handle it instead. If it can, it belongs at the exit gate — not injected into the generation side.
Scale Benefits Belong on the Review Side
AI scales. That's true for generation too — but I believe scale truly pays off on the review side.
On the generation side, there's a "lottery approach": keep asking AI to regenerate until you get a good result. More attempts eventually yield better output.
That said, I personally accept about 90% of AI output as-is. I think it's because I use AI as a collaborative partner rather than giving it detailed instructions. If generation quality is already high enough, there's no need to rely on the lottery approach. If you're going to invest scale somewhere, review is the more rational choice.
What happens when you invest scale in review? Multiple AIs independently question the same PR. Each raises questions from different angles, catching what others miss. Anyone who has worked in code review knows the feeling of "I want as many eyes on this as possible" — review works the same way. The more reviewers, the lower the probability of something slipping through.
Anthropic itself has adopted a design for Code Review that runs multiple agents in parallel, then cross-checks to filter out false positives. Could this also be seen as an embodiment of the idea that scale belongs on the review side?
The lottery approach in generation just accumulates cost. Scale in review has intrinsic value: it reduces what gets missed.
Human Eyes Still Have Unique Value
Does this mean humans are unnecessary if AI handles review? I don't think so.
Business judgment, implicit team context, "this is technically correct but is it acceptable for our organization?" — these are areas where humans currently have an edge over AI. That's why there's still good reason for humans to actively invest time in review.
Scale AI in review, and have humans actively invest in review too. Increasing both kinds of "eyes" raises the quality of the team's answers.
Summary
| Process | Who does it | What scale means |
|---|---|---|
| Generation (opening PRs) | Let AI go all out | More output, but no gain in quality |
| Review (questioning answers) | Invest both AI and humans | Quality improves |
When AI joins the team, the team's job is not to "beat AI at generation." It's to question the answers AI produces — together.
Closing
Behind Anthropic's release of Code Review lies a real problem: AI is generating so much code that review has become a bottleneck. Code output per engineer at Anthropic has grown 200% over the past year.
This trend won't stop. Generation will keep increasing, and the importance of review will only grow.
If you truly trust AI as a teammate, let them go all out producing answers. And as a team, concentrate your resources on questioning those answers.
Concretely: adopting tools like Code Review, enforcing style conventions with linters and automation, and making sure humans have time for review — these three are a good place to start.
That's my conclusion for now.
References
Top comments (0)