Jamund Ferguson

Posted on Mar 15

Code Review Kinda Sucks Now

#ai #codereview #discuss #softwaredevelopment

tl;dr Code review was never just about finding bugs. We need better tools to help us make sure we don't lose the understanding of our codebases as the rate at which they grow increases with AI.

Recently, I was really impressed by an article posted on X by Dominic Elm.

In particular, this quote absolutely drew me in:

Code review was never really about catching bugs. It was about building shared understanding, a team's collective mental model of what the system does, why it works that way, and how it can safely change

The article went on to detail all the challenges with AI making it so easy to generate code but not review it he had a couple of conclusions at the end, which I'll summarise here.

Make sure to apply AI code review right away, so humans can focus on the more important architectural issues.
Review AI-generated code with a different lens in mind than human-generated code.
Put more emphasis on the author proving they understand their own code.

I thought these were fantastic and just wanted to add a few more ideas to that based on my own experience.

Code Review as Team Sport

The quote about code review being for shared understanding that resonated with me so much reminded me of this incredibly hard to find blog post by Nicolas C. Zakas (thank you, Internet Archive). Nicolas is the creator of ESLint and a prolific JavaScript author. The article is called "Effective learning through code workshops"

What he describes in the beginning is the way code reviews were performed, at very serious companies, before GitHub and similar tools for async review became popular. Code reviews were literally meetings where the team would review code that had been printed out on pieces of paper. Everyone would be in there looking for bugs by hand. Writing notes with pens and pencils. With the person who wrote the code often looking quite uncomfortable and nervous, receiving all this negative feedback. That approach obviously couldn't scale and was eagerly replaced by the async code review model that was likely made popular with the proliferation of open source and GitHub.

Nicolas took that old school approach and tried to modernise it for Box. He called them "code review workshops". Where weekly someone would be assigned to look at code that they didn't write and explain it to the team. And the whole team walks through that code together and identifies patterns and practices that they like and dislike, and makes notes for future improvement. Confusing code is really easy to notice and dissect in this type of meeting. Context is shared with the team. You're not reviewing a single pull request with the whole group; but usually it's a feature or file that might have been contributed to by multiple individuals.

When I was at PayPal, I had a fantastic engineering manager named Rose Elliot, who used this technique as a way to strengthen and uplevel our team.

It was a great way for people to learn about parts of the codebase that weren't theirs. And it promoted a culture where we were continually improving and documenting our code and best practices. During each session, we would make notes about what we wanted to change, and create issues in our task management system to follow up on those changes. We still leaned into asynchronous code reviews. We also spent a lot of time on linting and automating the identification of common mistakes. But taking the time to review code together regularly helped us build a shared understanding of both how things worked today and what we wanted our app to look like in the future.

One thing I like about this approach is that it doesn't slow down progress in the sense that you have to wait for an in-person code review meeting to get the code merged or even shipped to production. But it provides a feedback mechanism to the whole system to make sure that you're going in the right direction and keeps everyone more informed.

I don't know if this is the solution for dealing with the massive amount of code that we're getting in this AI/agentic-engineering era, but it feels like it could help.

Reducing Cognitive Load

Reviewing giant pull requests is never easy. And the code review article from Dominic goes into various stats on how the longer the pull request, the more likely a reviewer is to make mistakes or just gloss over it without giving any real feedback. This should pretty relatable for most devs. So how can we reduce the cognitive load for reviewers?

Smaller Pull Requests

In the article, it was described that smaller pull requests help reduce cognitive load. And there was something mentioned about a code review tool called graphite, which uses an approach called stacked diffs, to basically make it easier to look at the difference between individual commits and review them separately. It breaks down a complex pull request into smaller pieces that can be reviewed individually. It's a slightly different workflow than traditional GitHub pull requests, but it's backward compatible and looks promising.

Outside of using a separate tool, I found recently when using Claude Code, instead of executing one large plan it can help to break down your work into smaller plans. So we talked through the work that needed to be done, created a large plan. I saved that to the side and then had Claude execute smaller individual plans and committed the results each time or in some cases made pull requests for each (smaller) implementation separately. Just because Claude can do a whole bunch of stuff at once doesn't mean you shouldn't break it down into easily understandable chunks.

Diagrams and Documentation

One of the big suggestions in Dominic's article was asking the reviewer to include diagrams in their pull requests so that they could sort of prove that they understood the system impact of their code changes. I've been thinking a lot about how diagrams and visual representations of code can help us more quickly grasp what's going on. Line-by-line diffs are really hard to reason about and it's clear that more advanced semantic diffing and visual diffing can help us understand these things more quickly. But one thing I found interesting was he said at his work, people just use AI to generate the diagrams, and it didn't really help very much. Developers weren't actually internalising what was happening.

Similarly, Mikey See from Convex, tried to use nine different code review tools and made a video with his results.

One of the tools called Sorcery AI, automatically include diagrams with every pull request.

I've been fairly convinced that this kind of thing will really help engineers better understand what's happening in code changes. But in the review, he just said it was noise and it was just completely ignored, every single time.

Is there room for this kind of tool and graph automatic diagram generation? I think so. I think that we just haven't found the right place for it yet. One suggestion might be to essentially automatically include documentation with code. Again, with the emphasis that even though humans maybe didn't write the code, hopefully they'll be like a README file, say, with each component or each feature folder that can have some really good details about how it works and what's going on so humans can get up to speed very quickly.

With React components, it would be super easy for AI to always create storybook examples for every single component you do, and then maybe we could even render those storybooks directly in the code review. Right? What if you could actually see the component right there alongside the code.

Semantic Diffs & Code Graphs

I can still think of a lot of use cases for visuals and graphs that could help, even if the current implementations aren't that useful. Even just imagine a simpler way of looking at a diff between, let's say, two React components. The things that matter the most are prop changes, state changes, hook changes. Clearly line by line diffs aren't that useful here. Are we changing the types of the props? Are we changing how we're managing state? Is there a better way we could visualise that? Outside of React thinking about, say, a function, the inputs and outputs are the things that matter the most right, especially if the function is well tested. Can we surface that more clearly?

I gave a talk ten years ago at UtahJS where I spoke about using abstract syntax trees to analyse your code systematically. And one of the examples I gave is using ASTs to generate meaningful diffs.

There's a tool called Sem that I recently found that applies some of these approaches.

Imagine something like this, but with arrows and colour coding and everything to make it really obvious what the type looked like before and what the type looked like after.

AI code review right now can help somewhat with basically fancier linting, and potentially matching architectural patterns, but AST-based diffs and semantic diffing might help us highlight the things that matter more in diffs, so that humans can more quickly grok what's going on. This is absolutely going to be necessary in the future if we're going to keep ahead of all this mountain of AI-generated code.

Code Maps & Blast Radius Detection

We're also seeing some movement around graphing code and understanding the blast radius of a change. I think this is also really interesting because, to some extent, with AI code review there might be parts of the code that we really don't care what the details are. Imagine there's a small feature not on a critical path. It might be good to know that no matter what's in this code, it's not going to affect anything else. Tools like that might help us narrow the scope of what human code review needs to look like, or at least emphasise where we really should pay attention to a code change.

There's a tool going on right now called GitNexus that attempts to do that, but I'm sure there are others as well.

GitKraken also offers "mode maps" and other visualizations:

Conclusion

Code review kind of sucks right now. It's so easy to generate code that looks good. But it's hard to wrap our heads around the impact of those changes. We need to adopt better processes and tools to help our engineers quickly get up to speed in understanding the changes happening in our codebases.

Here are a couple of suggestions that build on Dominic's article that I think might help.

We should take another look at Team Code Review workshops. With AI doing most of the coding these days, who's going to get offended if you don't like their implementation? Let's do it. Modified for the hybrid and remote workplace, of course.
We need to keep down the size of our code reviews, stacked diffs might be a real solution there. Or even just making sure your AI agents follow some simple guidelines on how frequently they commit changes.
We need better tools

There are literally dozens of VC-funded AI code review tools duking it out in the marketplace. While I think they are becoming even more essential at helping us find bugs in our apps, we need remember code review is also about "building shared understanding." These tools are not doing that and they really aren't trying to. We need new code review tools that focus on the other side of code review. Not just finding bugs, but helping the team understand what it is that they're putting into their codebase and guiding that massive flow of changes in the right direction.

Top comments (2)

Tash Hlanguyo • May 3

The point about code review being for shared understanding really hits. It is easy to miss that part when AI is pumping out PRs faster than humans can actually internalize them.

The workshop format you described at PayPal is probably undersold. The reason it works is that there is no merge pressure. Someone is explaining code to the room, not defending it. That psychological shift changes everything about how honestly people engage with what is there.

On the "blast radius detection" angle, the hard part is not building the tooling, it is getting teams to slow down long enough to care. Most dev cycles are now so AI-accelerated that the window for that kind of review is genuinely shrinking. Smaller diffs help but they only solve the tactical problem. The strategic problem is that nobody owns understanding the system anymore.

Curious whether you think AI-generated diagrams will actually land when the AI also wrote the code. The reason Nicolas's workshops worked is that a human had to internalize and explain. If the AI summarizes its own changes, you lose the forcing function entirely.

Xiang ZHU • May 31

Thanks for sharing, as a tech lead, I need to review a lot of code generated by AI, and that’s becoming a real challenge in this role.
We need methodoly to guide AI how to coding, as well as a process for human review.
Although I haven’t found a complete solution in this post, it has at least given me some insight. Thanks.