Nagarjuna Yelisetty

Posted on May 2

AI Agents Don't Fail at Code. They Fail at Learning.

#ai #maintainability #engineeringleadership #codequality

Part 1 of 5 in The New Engineering Contract — what it means to lead engineers when AI is doing more of the coding.

SWE-CI tested 18 AI models across 71 consecutive commits. Most broke something on commit 47 they'd already broken on commit 1. That's not an intelligence problem. That's a learning system that isn't learning.

A paper made me uncomfortable this month.

Not because of what it found about AI. Because of what it revealed about how I think about my own work.

The paper is SWE-CI, published March 4, 2026 by researchers at Sun Yat-sen University and Alibaba Group. It tested 18 AI models across 100 real codebases — not single bug fixes, but 71 consecutive commits of genuine evolution. The core finding: most state-of-the-art models have a zero-regression rate below 0.25. Three out of four times, the agent fixed something and silently broke something else downstream.

I read that and thought: that's a learning problem, not a coding problem.

What the paper actually tests

Most benchmarks ask: can an AI fix this bug? SWE-CI asks a harder question.

"SWE-CI moves beyond fixing individual bugs and instead focuses on the evolutionary trajectory between two commit versions."

— SWE-CI paper, Chen et al., 2026

The benchmark covers 100 tasks, each spanning an average of 233 days and 71 consecutive real commits. Agents must navigate a full CI loop — generating requirements, modifying source code, running tests — iteratively, not in a single shot. That's the difference between a sprint task and a six-month project. The paper is evaluating the second thing.

Figure 1 from the paper: SWE-CI's Architect–Programmer dual-agent evaluation protocol. The agent must execute a CI-loop across 71 consecutive commits — not patch a single bug in isolation.

I have one signal I've used for years to tell whether someone on my team is actually growing: are they making different mistakes?

Make the same mistake twice and I'm concerned. Three times and I have a conversation — not a performance conversation, a diagnostic one. I want to understand the mechanism. Did the signal not reach them? Did they receive it and not act on it? Did they act on it and still land in the same place?

The answer changes everything I do next. A signal that didn't reach someone is an infrastructure problem — maybe they're out of the right post-mortems, or the runbook is wrong. A signal received but not acted on is a motivation or attention problem. A signal acted on but still producing the same failure is a mental model problem — they changed the surface behaviour without touching the root cause.

Ten times the same mistake, none of those explanations hold. That's carelessness or disengagement, and I treat it differently.

The same mistake twice is entropy. A new mistake is evidence of a mind moving forward.

I didn't always run this diagnostic. At Medibuddy, we had a recurring 401 issue — users being logged out mid-flow in the webview even when they were still logged into the native app. The code review instruction was explicit: handle 401 universally, refresh the token, add exponential backoff, apply it regardless of whether the user came from Android, iOS, or web. One engineer fixed it in the obvious flow. I reviewed the PR, it looked right, and moved on. Three weeks later, the same incomplete pattern surfaced in a different flow. Same 401. Different screen.

I had reviewed the output, not diagnosed the understanding. They'd absorbed the instruction for one case. The mental model hadn't transferred. That's not a skill failure. That's a learning failure. It has a specific shape.

AI agents have the same shape

Now look at most AI agents. They fail the same way on commit 47 that they did on commit 1. There's no diagnostic conversation. No signal-to-action loop. No mechanism to distinguish "I didn't receive the signal" from "I received it and didn't know what to do with it." The agent just proceeds. Same failure pattern, new commit.

The paper formalises this with EvoScore:

"Good maintenance not only ensures functional correctness of current code, but minimizes difficulty of keeping code correct."

— SWE-CI paper, Chen et al., 2026

EvoScore doesn't ask whether an agent passes tests. It asks whether passing today's tests makes tomorrow's tests easier or harder. An agent that hardcodes an assumption — true right now — passes commit 1 and silently poisons commit 12. An agent that fixes the underlying abstraction makes the next three commits cleaner.

That's the same thing I'm measuring when I track whether an engineer makes different mistakes — are their decisions compounding toward something, or just recurring?

Figure 2 from the paper: Model leaderboard measured by Average Normalized Change (ANC). Only the Claude Opus series exceeds a 50% zero-regression rate. Every other model falls below 25%.

I've been building against this failure mode for years. At Medibuddy, we made a deliberate platform shift: migrate from AngularJS to React, move away from native apps toward a unified web layer — an NX monorepo with shared libraries owning the hard parts. Authentication flows. The native-web bridge. Event contracts. The component layer. Every product team built on those blocks rather than rebuilding them. The kind of investment big tech formalises as internal developer platforms or design systems. We called it Web LEGO. The design principle wasn't elegance. It was familiarity. If something breaks, it breaks the same way for everyone. Familiar failures get diagnosed faster. Familiar failures get fixed faster. The platform aged well not because it was clever, but because it stopped surprising us.

But I couldn't tell you that as a number. I could feel it — maintenance windows stopped appearing in my calendar, teams stopped fearing release Fridays — but I had no score. No rate. No proof.

The clearest signal came from outside engineering entirely. After one performance optimisation, our CMO passed feedback to our CTO, who passed it to me: "The Android app feels faster." My dashboard showed nothing. API response times flat. Error rate flat. Crash rate flat. But a user felt something, and that feeling traveled through the C-suite before it reached the people who built it.

That is the measurement gap. The best systems earn trust so thoroughly they bypass your instruments entirely.

The question SWE-CI is asking

The paper has limits. 100 repositories, Python only, no human baseline. Lehman's Laws — which it cites as foundational — were social observations from IBM's OS/360 system in 1980, and Lehman himself later clarified they should be read as social-science laws, not physical constants. EvoScore will be gamed — or transcended. As agentic coding shifts from single-shot generation to continuous autonomous loops across commit timelines, the next wave of models will be optimised for exactly this trajectory. The benchmark becomes the floor, not the ceiling. The same pattern played out with SWE-bench, compromised within 18 months of release. That evolution won't dissolve the learning problem. It will make it harder to see.

But the question it's asking is the right one.

Michael Truell, CEO of Cursor, posted this in January 2026:

"We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week. It's 3M+ lines of code across thousands of files... It kind of works!"

— Michael Truell on X, January 14, 2026

The Register called it shoddy code at scale. Both descriptions are accurate. It passed commit 1. Nobody knows what it looks like at commit 47 — because it was never built to reach commit 47. That's not a failure of AI capability. That's a failure of what we decided to measure.

You can't fix what you aren't tracking. I learned that watching engineers make the same mistake twice.

The uncomfortable truth

Here's the uncomfortable truth I can't argue my way around.

Most SOTA models have a zero-regression rate below 0.25. That number hasn't moved significantly across the frontier models I use today. Which means if I take my hands off the wheel — merge code I haven't read, deploy features I haven't traced the assumptions on — I'm accepting a 75% chance of silent breakage downstream.

That's not a reason to stop using AI. It's a reason to stay in the loop.

I use it to draft TRDs — not to write them, but to surface the assumptions I'd have held silently. I use it as a sounding board before committing to a direction. I use it to prototype fast, then review every prototype for what it assumes before it goes near production. Fast code carries fast assumptions. Speed and carelessness travel together.

The loop isn't friction. It's the only thing converting AI's output speed into engineering quality.

Blind AI coding isn't a productivity strategy. It's entropy at machine speed.

One changed question

After reading this paper, I changed one question in how I review AI-generated code.

Before: does this pass the tests?
After: what does this fix assume — and will that assumption still hold after the next three features?

That question isn't in most PR templates. It should be — for AI-generated code. And honestly, for human-written code too.

The difference is that a person, over time, can internalise that question and start asking it themselves. The learning compounds. AI agents right now don't have that mechanism. Every commit is day one. The agent that fixed the 401 in flow A has no memory of flow B. No diagnostic loop. No compounding.

That's what SWE-CI is measuring. Not whether AI can write code. Whether it can write code that compounds.

I've been trying to build that — in systems, in teams, in how I develop engineers — for years. The unit of measurement changes. The failure mode doesn't.

I still can't measure it precisely.

But when it's working, a user updates their app and feels something they can't name. That feeling travels up through your CMO. It reaches you.

That's the score that matters. And it's the one most AI governance conversations aren't yet designed to reach.

In Part 2: what happens when two engineering organisations face this at scale — and respond differently. Amazon instrumented AI across millions of orders. Stripe built 6 billion test runs a day. Same tools. What each organisation chose to trust, and how much, is the whole story.