DEV Community

Cover image for We Stopped Reading Each Other's Code
Arthur
Arthur

Posted on • Originally published at pickles.news

We Stopped Reading Each Other's Code

Last month I approved a 847-line pull request after spending maybe twenty minutes on it. Tests were green. Linter was clean. Coverage sat at ninety-one percent. Two days later something went wrong with our payment webhook on production, and when I went back to read the code that had broken, I realized I could not explain why any of it was the way it was. The commit had my name on it because I had merged it. An AI assistant had written most of the lines. I had read it the way you read a contract before clicking "I agree."

This is the part of the AI-coding story that nobody wants to talk about, because the talk is mostly about velocity. Velocity is up. Anyone using a coding assistant for thirty seconds can feel that. Code arrives faster than it ever has. The bottleneck of "person who can type" has been removed, and the chair next to mine that used to hold a pair-programming partner is now occupied by something that never tires, never gets bored, and never asks me to slow down.

What we forget, because we forget it every twenty years or so, is that code review was never really about catching bugs. Catching bugs was the side effect. The actual job of the review was to spread understanding through a team. To force at least two people to know why a piece of software did what it did. To turn one engineer's private knowledge into shared institutional memory. The bottleneck was the point.

A debt with no statement

Addy Osmani, Google's Head of Chrome Developer Experience, coined a phrase for what comes next, in a March essay that has been quietly making the rounds. He calls it comprehension debt. The gap between how much code lives in your system and how much of that code anyone on your team can actually explain.

Comprehension debt is not the same as technical debt, and the difference matters. Technical debt complains. It announces itself with slow builds and tangled imports and the dread you feel when a ticket lands on the one module everyone has been avoiding for a year. You know it is there. You filed a Jira about it. You promised yourself you would refactor it next quarter, which is what we say.

Comprehension debt is silent. The code is clean. The tests pass. The dashboards are green. Velocity is, if anything, up. The only place the debt shows up is at three in the morning, when production is down and nobody on the on-call rotation can articulate why a particular endpoint exists, which is when the debt comes due, all at once, with interest.

What the data is starting to say

It is tempting to wave this off as nostalgia from someone who liked code review better when it was slower. The data is not letting us.

GitClear analyzed 211 million lines of changed code from January 2020 through December 2024. Two of their numbers are worth memorizing. The share of changes classified as refactoring fell from twenty-five percent in 2021 to under ten percent by 2024. The share of code that got rewritten within two weeks of being committed nearly doubled, from 3.1% to 5.7%. We are writing more code, throwing more of it away soon after, and consolidating less of what stays. The shape of a codebase is changing under our feet.

In July 2025, METR ran a randomized controlled trial on sixteen experienced open-source developers — average five years of context on the repos they were working in — and gave them 246 real issues, half assigned to "AI tools allowed," half to "no AI tools." The developers thought the AI sped them up by twenty percent. Measured against the clock, AI tools made them nineteen percent slower. The thirty-nine-point gap between perception and reality is, for my money, the single most important number in this conversation. It tells you that the people best positioned to notice the cost cannot feel it.

And in a study Anthropic itself published, fifty-two engineers learned a new library; the AI-assisted half finished as fast as the control group but scored seventeen percent lower on a comprehension quiz the next day. Same time, less understood. The more interesting part of that study is the breakdown by how people used the tool. The engineers who treated the AI as a delegate ("just make this work") scored dramatically worse on the follow-up. The engineers who treated it as a tutor ("explain why this is a cancel scope and not a timeout") scored meaningfully better. Same model. Same prompts available. Different outcomes, depending entirely on whether you used the tool to bypass thinking or to think harder.

The two teams in the same room

If you go to r/ExperiencedDevs and read the threads about AI adoption, you find two kinds of teams. They are not divided by tools — they all have the same tools — and they are not divided by intent. They are divided by what they decided about their review standards.

One kind of team, often under pressure to ship, told reviewers to stop being so picky. AI-generated PRs went through with cursory glances. The merge button got hot. Outages got longer because nobody on call could reconstruct what the code was doing fast enough to fix it. Teams described real dollars lost to incidents they couldn't explain under pressure. The other kind of team kept their standard exactly where it was. If the PR was too big, they split it. If it was unclear, they sent it back. They merged less, slower, and they slept better.

Same industry, same tools, same year. The difference was a decision about whether shipped lines or shipped understanding was the unit of work.

The test trap

The most common defense — and the one I made to myself before that webhook incident — is "tests will catch it." This is a comforting story right up until you notice that the AI is now writing the tests, too.

A coverage report cannot tell you that the payment processor occasionally sends duplicate webhooks. A green CI run cannot tell you that the third-party API renamed paid to payment_confirmed three weeks ago and your handler is now silently returning 200 to events it does not recognize. You cannot write a test for behavior you haven't thought to specify. AI-written tests are very good at confirming that the AI's implementation matches the AI's understanding. Whether either of those matches reality is a separate question, and it is the only question that matters in production.

What we owe each other

The honest version of all this is that comprehension debt is not new. I went back to a discount-calculation module I had written by hand two years ago, before any of this was an option, and asked an assistant to walk me through it. It found a bug — a precedence issue that had been there since the first commit, two years in production, never triggered. I had forgotten how my own code worked. The author was me. The author had also left.

Code without an author is, on reflection, what most code becomes. The AI didn't invent that condition; it accelerated it until we couldn't ignore it anymore. Which means the response is also not new. We already know how this works. Code review is not optional. Pull requests should be small. The reviewer's question is not "did this pass?" but "could I, the reviewer, fix this at three in the morning if I had to?" If the honest answer is no, the right move is to send it back, no matter how green the CI is, no matter who or what wrote it.

The hardest part of any engineering practice has always been holding the line on the boring parts when the boring parts are the ones being optimized away. AI coding tools are not going anywhere. Neither, hopefully, is the obligation we have to each other to ship software we can still explain.

Top comments (0)