The state of AI code reviews: An 18-month retrospective

#codereview #ai #programming

It has been almost a year and a half since our company started using AI intensively in our everyday work. We've tested various tools- such as Windsurf, GitHub Copilot, Cursor, Claude, and LibreChat- and I can now confidently comment on how AI models affect software engineering as a discipline.

In this article, I'll try to present my objective opinion on using AI for code reviews.

First, let me clarify- I'm not an AI specialist. I don't have the deep technical knowledge to discuss models or their architecture, as so many people on the Internet tend to do these days. It feels like everyone thinks they're an expert on LLMs, which obviously isn't realistic.

Code Review: A Critical Step in Software Development

Code review is a key stage in every software project, and neglecting it often leads to unpredictable and critical problems later on. Since the very beginning of the software era, reviewing written code- regardless of programming language- has been seen as a required step before the code is used in production.

Let's look at an example from history- the Apollo spacecraft software.

Hundreds of engineers worked on it during those years, creating the software without which a safe lunar landing and return would have been impossible. The Apollo software, called Luminary, contained over 145,000 lines of code. Every single file included comments (see on GitHub), and the entire codebase went through multiple review and approval iterations.

Today, we have automated tools, compilers, and sophisticated software practices. But back then, every line was written, reviewed, and optimized by hand. Even in the early 1960s, processes like defining requirements, design, coding, testing, and maintenance were strictly followed.

Margaret Hamilton, who led the lab developing the Apollo flight software, once said:

What became apparent with Apollo- though it is not how it worked- is that it is better to define your system up front to minimize errors, rather than producing a bunch of code that then has to be corrected with patches on patches. It's a message that seems to have gone unheeded- in this respect, software today is still built the way it was 50 years ago.

By focusing on finding and fixing errors early, the system was stable enough to handle unexpected CPU overloads just seconds before the lunar landing. No software errors were reported during any of the manned Apollo missions- a remarkable testament to human precision.

The Code Review Pyramid

A well-known concept in software engineering is the Code Review Pyramid, which illustrates the relative importance of different review aspects: Code Style, Documentation, Testing, Implementation, and Functionality & Design.

From my own experience, I've always applied this approach when reviewing others code. But the first review always happens with my own code- by me.

Before you push your changes or ask an AI to check them, revisit the pyramid yourself.

Don't just check for syntax.
Imagine you are reading it for the first time.
What is confusing? What is implicit?

I find it helpful to take a break between writing and reviewing my code. Many developers submit changes at the end of the day, but that's when fatigue leads to overlooked errors. Waiting until morning and reviewing the changes with fresh eyes dramatically improves the result.

Where Does AI Excel?

Several parts of the review process can be automated- checking code style, syntax errors, test coverage, code optimization, and error detection.

AI models are exceptionally precise in finding mistakes and identifying poor coding practices. Their consistency is remarkable: they spot the same problems every single time. When it comes to adherence to conventions or catching minor flaws, AI is laser-focused.

Moreover, AI effortlessly scales- it can review dozens of pull requests per day without fatigue, something almost impossible for a human reviewer.

Here are a few clear advantages of using AI in code review:

Immediate feedback – In traditional reviews, responses might take hours or even days, leaving developers waiting. With AI, the feedback comes instantly.
No cognitive fatigue – Reviewing hundreds of lines of someone else's code can be mentally draining. AI never tires.
Productivity scaling – In large teams, human reviewers spend significant time on reviews, impacting productivity. For AI, team size doesn't matter- it performs equally fast and effectively regardless.

A Cisco study found that reviewing more than 400 lines of code at once reduces one's ability to find bugs, with most defects discovered in the first 200 lines. This insight shaped industry practices- but AI doesn't suffer from this limitation. It can handle large reviews without performance degradation.

The Human Mind: The Unbeatable Factor

While AI performs flawlessly on the tasks mentioned above- and I rely on it heavily. It still lacks something fundamental- understanding.

As experts have pointed out, current AI systems are, in many ways, quite limited. We're fooled into thinking they're intelligent because they handle language so well. But they don't understand the physical world. They lack persistent memory, true reasoning, and long-term planning- all crucial aspects of genuine intelligence.

Machine learning generally operates under three paradigms:

Supervised learning

This is the classical approach. The model is trained using a dataset of examples that have both input and the correct output (labels). For instance, if you train a system to recognize objects, you might show it an image of a table and label it "table". The model then produces its own output and receives feedback: if it's wrong, its internal parameters are adjusted. Repeating this process millions of times helps the system form strong associations between inputs and correct predictions.

Reinforcement learning

This technique more closely resembles certain aspects of human learning. Instead of being told the exact right answer, the AI acts, observes the consequences, and receives a reward or penalty. It adjusts its future actions to maximize the long‑term reward. Think of how you learned to ride a bike- through trial, error, and correction. However, this paradigm has limitations. It’s inefficient and effective only in clearly defined environments (like playing chess, Go, or poker) where success metrics are known and unambiguous. In complex, real‑world settings without clear feedback, reinforcement learning becomes impractical.

Self-supervised learning

This is the foundation of the most recent revolution in AI, including large language models such as ChatGPT. Here, the system learns from unlabeled data by creating its own predictive tasks- for example, trying to predict missing words in a sentence. By training on vast quantities of text, the model builds internal representations of patterns and relationships between words and concepts. This approach allows AI to gain an impressive ability to generate coherent and context‑aware language, but it still does not give the model genuine understanding or reasoning ability- it is pattern recognition, not comprehension.

Even with all this sophistication, AI remains fundamentally limited by its training data and objectives. It does not understand things as humans do; it merely models statistical relationships.

Software, by definition, is a tool that must operate according to human needs and actions- sometimes even playing a role in life-critical systems. This gap is where human consciousness and intuition remain irreplaceable.

Final Thoughts

I'll end with a visual that perfectly captures how the development landscape has shifted in the past two years.

Writing code itself has never truly been the problem. The real challenge has always been delivering error-free and dependable software.

The lesson is clear: AI is a powerful ally—especially for automating repetitive parts of the review process—but the human element remains the final safeguard of wisdom, empathy, and understanding that no algorithm can yet fully emulate.