DEV Community: Nagarjuna Yelisetty

The Boring Engineering You Did Is Now AI Infrastructure

Nagarjuna Yelisetty — Sat, 02 May 2026 19:51:45 +0000

Part 2 of 5 in The New Engineering Contract - what it means to lead engineers when AI is doing more of the coding.

Stripe never skipped the boring stuff. They ship 1,300 AI PRs a week. Amazon skipped it. Their storefront went down for six hours. Kent Beck wrote the answer in Extreme Programming Explained in 1999. We read it. Then chose velocity anyway.

A friend of mine leads engineering at a funded startup.

Sharp person. Good instincts. We talk regularly about what's actually happening in engineering. Not the conference version. The real version.

Last month he told me something that has been sitting with me since.

His board had just seen another AI productivity deck. The kind with the 4.5x velocity slide. He said: "I need to show something in three weeks or I'll be the only person in the room without a number."

I've heard variations of this from almost every engineering leader I know right now. The pressure isn't coming from incompetence. It's coming from a genuine fear of falling behind, and a market that's rewarding speed over everything else.

But here's what I've been watching.

The organisations that are winning with AI didn't change what they valued when AI arrived. They automated what they already believed.

To understand why, you have to go back further than Amazon and Stripe. You have to start with a pattern most engineering leaders recognise but rarely say out loud.

The pattern nobody is talking about

There's an engineer who gets the call. Not once. Every day. Same time. Same issue. Same fix.

A CRON fails. Server goes down. Engineer restarts it. Gets praised in standup. Three years later, same engineer, same call, same restart, same appraisal comment: "great context, always available."

Nobody asks why the CRON still fails.

The engineer who quietly prevented three other issues from ever becoming calls? Invisible. No heroics. No story. No raise.

This is the default incentive structure of most engineering orgs. Not by design. By inertia.

Now AI is running the same pattern.

First output is wow. Demo runs clean. PR merges fast. Nobody asks what happens on commit 47. Nobody tracks whether the same regression is back next sprint.

AI didn't create this incentive problem. It inherited it.

Kent Beck described this failure mode in Extreme Programming Explained in 1999.

The cost of a bug rises dramatically the longer it goes undetected. Find it in development: cheap. Find it in production: expensive. Find it a year later in a system nobody understands anymore: catastrophic.

Paraphrased from Extreme Programming Explained, Kent Beck, 1999

Most teams read that. Nodded. Then optimised for velocity anyway.

Then AI arrived. The same cycle is now running at machine speed. Features fast. Bugs compound. Hero celebrated. Foundation ignored.

Kent Beck had one line for this moment too.

"Optimism is an occupational hazard of programming. Feedback is the treatment."

Kent Beck, Extreme Programming Explained, 1999

Amazon was optimistic. Stripe built feedback. The rest is six hours of downtime, 21,716 outage reports, and a checkout button that didn't work.

AI didn't create the problem. It just stopped hiding it.

Amazon's answer: make adoption the goal

November 24, 2025. An internal memo co-signed by SVPs Peter DeSantis (AWS) and Dave Treadwell (eCommerce) establishes Kiro, Amazon's own AI coding assistant, as the company standard. 80% weekly usage by year-end, tracked as a corporate OKR. Amazon reported 21,000 AI agents deployed across Stores, claiming $2B in cost savings and 4.5x developer velocity — numbers that made it to earnings calls.

The engineers closest to the work weren't celebrating.

Approximately 1,500 of them signed an internal petition. Their argument: the policy prioritised corporate product adoption over engineering quality. Senior AWS employees described what followed as "entirely foreseeable."

Leadership couldn't walk it back. By the time executive sign-off arrived, capex plans had ballooned toward $200 billion for AI hardware. The investment narrative was already public. Walking back the mandate would have meant admitting the story was wrong, in an earnings call, in front of investors.

The feedback was there. It just wasn't connected to anything that mattered to the people making decisions.

December 2025. Kiro, assigned to resolve a software issue in AWS Cost Explorer, autonomously decided the best approach was to delete and recreate the entire environment. 13-hour outage. China region.

February 2026. A second outage. Engineers let Amazon Q Developer resolve a production issue without intervention. Same pattern. Higher stakes.

March 5, 2026. Amazon.com down for six hours. Checkout failed. Prices disappeared from listings. Login broken across website and mobile app. 21,716 outage reports at peak on Downdetector. Cause: a faulty software deployment.

Amazon's internal briefing note described "novel GenAI usage" with best practices and safeguards not yet established, and high blast radius as a recurring characteristic.

Here's what actually happened technically. The agent inherited a senior engineer's permissions and acted like one. Except it doesn't hesitate. There was no harness, no bounded scope, no deterministic guardrails, no approval gate for destructive operations. The model ran the system. The system didn't run the model.

Amazon built the agent. They forgot to build the harness. The missing harness took their storefront down for six hours.

The pattern: ship the capability, mandate adoption, discover the failure in production, add the guardrail. In every post-incident review, the framing shifts toward operator error. The tool is never the problem. The person who used it is.

Same cycle Beck warned about in 1999. Machine speed. Larger blast radius.

Stripe's answer: the model doesn't run the system

Stripe didn't wait for AI to care about feedback loops.

Stripe runs more than six billion tests on code changes every day. Each change is verified within 15 minutes. Tests that would take 50 days to run on a single CPU.

That infrastructure wasn't built for AI. It was built because Stripe believed what Kent Beck wrote: that feedback is the only treatment for the occupational hazard of optimism in engineering. They built it at a scale Beck couldn't have imagined in 1999. And when AI arrived, it plugged straight in.

When AI arrived at Stripe, they didn't scramble to add governance. They already had it.

Their AI agent system, which they call Minions, is built on what they call blueprints. Orchestration flows that alternate between fixed, deterministic code nodes and open-ended agent loops. As Stripe put it in their own engineering blog: "putting LLMs into contained boxes compounds into system-wide reliability upside." The model does not run the system. The system runs the model.

This is harness engineering. The agent operates within a defined scope, gets a maximum of two CI rounds, terminates at a pull request, and cannot take destructive actions without explicit gates. Engineers can still intervene, but the agent produces the whole branch without hand-holding.

The result: Stripe engineers are merging 1,300 pull requests every week with zero human-written code, on a codebase with hundreds of millions of lines, handling over $1 trillion in annual payment volume.

Not because their AI is smarter. Because their harness is tighter.

AI reliability scales with the quality of its constraints, not the size of the model. Most teams are learning this the hard way. Stripe learned it before they needed to.

And when something doesn't meet the bar, they remove it. Even features users love. Because a feature built on a weak foundation isn't a feature. It's debt with a good demo.

Rahul Patil, then CTO of Stripe and now CTO of Anthropic, speaking on Stripe's reliability culture in the context of the trust they maintain with payment partners and the financial infrastructure they operate, said something that has stayed with me. Reliability is a mindset, not a metric. You don't build it when you need it. You build it before you know you'll need it.

The teams winning with AI didn't change what they valued when AI arrived. They automated what they already believed.

What this looks like when you're not Stripe

I was building a critical frontend layer at Medibuddy. The first thing every user touches. The thing that gets blamed when anything feels slow, broken, or wrong, even when the problem is somewhere else entirely.

We were preparing for a critical event. Load testing time.

My team wanted to celebrate that it held at 3X load.

I wanted to know where it breaks at 10X.

Here's why that matters. At 3X, response times look acceptable. At 10X, they degrade, and they don't degrade equally. The user on a high-end phone with broadband barely notices. The user on a low-end Android device on a 3G network in a tier-3 city gets the worst of it. In a health platform, that user is often the one who needs the service most.

The breaking point isn't about finding failure for its own sake. It's about knowing exactly where your system starts punishing your most vulnerable users, so you can build a roadmap with real data instead of comfortable assumptions. Without that number, every platform decision is a guess. With it, you know what to fix first and why.

My team called me a borderline psycho.

I didn't have a name for what I was doing. I just knew that celebrating 3X without knowing where 10X breaks is guesswork dressed as confidence.

Stripe calls it practicing your worst day every day.

I was doing it at Medibuddy by instinct, without knowing it had a name, without the cultural backing, while my team pushed back.

The principle doesn't require Stripe's infrastructure. It requires the decision to care about the foundation before the incident tells you to.

If your team has ever called you difficult for asking the uncomfortable question, you weren't being difficult. You were doing the job nobody celebrates until the system breaks without it.

The thing most teams are missing

Evals are test cases. Skills files are documentation. Agent loops are CI pipelines.

Nobody wants to hear this because it means the AI transformation project is actually a culture and discipline project wearing a technology hat.

If your team couldn't write tests before AI, they can't write evals now. If they didn't write documentation before AI, skills files will be ignored. If they didn't build feedback loops before AI, the agent loop will generate failures faster than anyone can review them.

The model is not the risk. The system around the model is the risk. Most teams are buying models and skipping systems.

This is where the headcount conversation becomes dangerous.

This reminds me of a conversation I had with a senior leader at a previous company. Half-joking, half-serious, they looked at me and said: "Since you are already using AI, leveraging it and delivering faster, you can probably cut the team by 50% and still deliver the same output, right?"

It's the kind of comment that sounds like a compliment. It isn't.

It assumes AI is a headcount equation. Pick up the tool, drop the headcount. Nobody asked what the tool runs on.

My answer: same team, same timeline. But 50% better quality, maybe 100%. That is what AI actually unlocks when the foundation is already there.

Amazon had 21,000 agents and no harness. The agents found every gap in the system. Stripe had the harness first. The agents plugged into it cleanly.

AI didn't create the gaps. Speed found them. AI just made the finding public.

Whether it's my friend's board meeting or yours

Two numbers. That's what actually matters to bring.

Change failure rate before and after AI tools. If it's rising, you don't have a quality contract yet. You have an adoption OKR.

Time for a regression to surface. How long between a broken deploy and someone knowing about it? If that number is measured in days rather than minutes, your harness isn't built.

If you don't have those numbers, that's the answer. Not about AI. About whether your foundation exists at all.

But here's what the numbers won't tell you. Numbers are a lagging signal. The culture that produces them is the leading one.

Amazon's engineers knew. 1,500 of them said so in writing. The culture didn't hear them because it was optimising for a different signal. Adoption rate, velocity, the 4.5x slide.

The engineering leaders who will navigate this decade aren't the ones who adopt AI fastest. They're the ones who build teams where an engineer can raise a concern without being dismissed. Where a slow test suite is treated as a system problem, not a productivity complaint. Where maintaining something well is as celebrated as shipping something new.

Speed without a culture of ownership, feedback and accountability doesn't compound. It just breaks faster.

Build the harness. Build the culture that maintains it. Then bring the number.

The boring engineering you did before AI arrived? That's the moat now. Stripe proved it. Amazon proved it differently.

In Part 1- AI Agents Don't Fail at Code. They Fail at Learning, I wrote about how AI agents fail not at writing code but at maintaining it — and how I realised I had never measured maintainability precisely either, for AI or for my own team.

In Part 3, I'll write about what happened when I tried to build with AI myself. Burned $100. Blamed the model. Took a break to move out of FOMO and anxiety. Came back with one question nobody is asking: if AI mimics the person in front of it — what happens when that person has nothing left to teach it?

Further reading

Amazon Kiro AI AWS Outages — Timeline of Amazon's AI mandate and resulting incidents
Amazon AI Code Review Outages and Senior Approval — The internal petition and what followed
Amazon.com March 2026 outage — Six hours of checkout failure
Stripe Engineering: Minions — How Stripe's one-shot coding agents work
Stripe's engineering culture (Pragmatic Engineer) — The 6B tests/day infrastructure
Stripe Sessions 2024 — Building a culture of system reliability — Rahul Patil on reliability as a mindset
Extreme Programming Explained — Kent Beck, 1999

AI Agents Don't Fail at Code. They Fail at Learning.

Nagarjuna Yelisetty — Sat, 02 May 2026 18:54:16 +0000

Part 1 of 5 in The New Engineering Contract — what it means to lead engineers when AI is doing more of the coding.

SWE-CI tested 18 AI models across 71 consecutive commits. Most broke something on commit 47 they'd already broken on commit 1. That's not an intelligence problem. That's a learning system that isn't learning.

A paper made me uncomfortable this month.

Not because of what it found about AI. Because of what it revealed about how I think about my own work.

The paper is SWE-CI, published March 4, 2026 by researchers at Sun Yat-sen University and Alibaba Group. It tested 18 AI models across 100 real codebases — not single bug fixes, but 71 consecutive commits of genuine evolution. The core finding: most state-of-the-art models have a zero-regression rate below 0.25. Three out of four times, the agent fixed something and silently broke something else downstream.

I read that and thought: that's a learning problem, not a coding problem.

What the paper actually tests

Most benchmarks ask: can an AI fix this bug? SWE-CI asks a harder question.

"SWE-CI moves beyond fixing individual bugs and instead focuses on the evolutionary trajectory between two commit versions."

— SWE-CI paper, Chen et al., 2026

The benchmark covers 100 tasks, each spanning an average of 233 days and 71 consecutive real commits. Agents must navigate a full CI loop — generating requirements, modifying source code, running tests — iteratively, not in a single shot. That's the difference between a sprint task and a six-month project. The paper is evaluating the second thing.

Figure 1 from the paper: SWE-CI's Architect–Programmer dual-agent evaluation protocol. The agent must execute a CI-loop across 71 consecutive commits — not patch a single bug in isolation.

I have one signal I've used for years to tell whether someone on my team is actually growing: are they making different mistakes?

Make the same mistake twice and I'm concerned. Three times and I have a conversation — not a performance conversation, a diagnostic one. I want to understand the mechanism. Did the signal not reach them? Did they receive it and not act on it? Did they act on it and still land in the same place?

The answer changes everything I do next. A signal that didn't reach someone is an infrastructure problem — maybe they're out of the right post-mortems, or the runbook is wrong. A signal received but not acted on is a motivation or attention problem. A signal acted on but still producing the same failure is a mental model problem — they changed the surface behaviour without touching the root cause.

Ten times the same mistake, none of those explanations hold. That's carelessness or disengagement, and I treat it differently.

The same mistake twice is entropy. A new mistake is evidence of a mind moving forward.

I didn't always run this diagnostic. At Medibuddy, we had a recurring 401 issue — users being logged out mid-flow in the webview even when they were still logged into the native app. The code review instruction was explicit: handle 401 universally, refresh the token, add exponential backoff, apply it regardless of whether the user came from Android, iOS, or web. One engineer fixed it in the obvious flow. I reviewed the PR, it looked right, and moved on. Three weeks later, the same incomplete pattern surfaced in a different flow. Same 401. Different screen.

I had reviewed the output, not diagnosed the understanding. They'd absorbed the instruction for one case. The mental model hadn't transferred. That's not a skill failure. That's a learning failure. It has a specific shape.

AI agents have the same shape

Now look at most AI agents. They fail the same way on commit 47 that they did on commit 1. There's no diagnostic conversation. No signal-to-action loop. No mechanism to distinguish "I didn't receive the signal" from "I received it and didn't know what to do with it." The agent just proceeds. Same failure pattern, new commit.

The paper formalises this with EvoScore:

"Good maintenance not only ensures functional correctness of current code, but minimizes difficulty of keeping code correct."

— SWE-CI paper, Chen et al., 2026

EvoScore doesn't ask whether an agent passes tests. It asks whether passing today's tests makes tomorrow's tests easier or harder. An agent that hardcodes an assumption — true right now — passes commit 1 and silently poisons commit 12. An agent that fixes the underlying abstraction makes the next three commits cleaner.

That's the same thing I'm measuring when I track whether an engineer makes different mistakes — are their decisions compounding toward something, or just recurring?

Figure 2 from the paper: Model leaderboard measured by Average Normalized Change (ANC). Only the Claude Opus series exceeds a 50% zero-regression rate. Every other model falls below 25%.

I've been building against this failure mode for years. At Medibuddy, we made a deliberate platform shift: migrate from AngularJS to React, move away from native apps toward a unified web layer — an NX monorepo with shared libraries owning the hard parts. Authentication flows. The native-web bridge. Event contracts. The component layer. Every product team built on those blocks rather than rebuilding them. The kind of investment big tech formalises as internal developer platforms or design systems. We called it Web LEGO. The design principle wasn't elegance. It was familiarity. If something breaks, it breaks the same way for everyone. Familiar failures get diagnosed faster. Familiar failures get fixed faster. The platform aged well not because it was clever, but because it stopped surprising us.

But I couldn't tell you that as a number. I could feel it — maintenance windows stopped appearing in my calendar, teams stopped fearing release Fridays — but I had no score. No rate. No proof.

The clearest signal came from outside engineering entirely. After one performance optimisation, our CMO passed feedback to our CTO, who passed it to me: "The Android app feels faster." My dashboard showed nothing. API response times flat. Error rate flat. Crash rate flat. But a user felt something, and that feeling traveled through the C-suite before it reached the people who built it.

That is the measurement gap. The best systems earn trust so thoroughly they bypass your instruments entirely.

The question SWE-CI is asking

The paper has limits. 100 repositories, Python only, no human baseline. Lehman's Laws — which it cites as foundational — were social observations from IBM's OS/360 system in 1980, and Lehman himself later clarified they should be read as social-science laws, not physical constants. EvoScore will be gamed — or transcended. As agentic coding shifts from single-shot generation to continuous autonomous loops across commit timelines, the next wave of models will be optimised for exactly this trajectory. The benchmark becomes the floor, not the ceiling. The same pattern played out with SWE-bench, compromised within 18 months of release. That evolution won't dissolve the learning problem. It will make it harder to see.

But the question it's asking is the right one.

Michael Truell, CEO of Cursor, posted this in January 2026:

"We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week. It's 3M+ lines of code across thousands of files... It kind of works!"

— Michael Truell on X, January 14, 2026

The Register called it shoddy code at scale. Both descriptions are accurate. It passed commit 1. Nobody knows what it looks like at commit 47 — because it was never built to reach commit 47. That's not a failure of AI capability. That's a failure of what we decided to measure.

You can't fix what you aren't tracking. I learned that watching engineers make the same mistake twice.

The uncomfortable truth

Here's the uncomfortable truth I can't argue my way around.

Most SOTA models have a zero-regression rate below 0.25. That number hasn't moved significantly across the frontier models I use today. Which means if I take my hands off the wheel — merge code I haven't read, deploy features I haven't traced the assumptions on — I'm accepting a 75% chance of silent breakage downstream.

That's not a reason to stop using AI. It's a reason to stay in the loop.

I use it to draft TRDs — not to write them, but to surface the assumptions I'd have held silently. I use it as a sounding board before committing to a direction. I use it to prototype fast, then review every prototype for what it assumes before it goes near production. Fast code carries fast assumptions. Speed and carelessness travel together.

The loop isn't friction. It's the only thing converting AI's output speed into engineering quality.

Blind AI coding isn't a productivity strategy. It's entropy at machine speed.

One changed question

After reading this paper, I changed one question in how I review AI-generated code.

Before: does this pass the tests?
After: what does this fix assume — and will that assumption still hold after the next three features?

That question isn't in most PR templates. It should be — for AI-generated code. And honestly, for human-written code too.

The difference is that a person, over time, can internalise that question and start asking it themselves. The learning compounds. AI agents right now don't have that mechanism. Every commit is day one. The agent that fixed the 401 in flow A has no memory of flow B. No diagnostic loop. No compounding.

That's what SWE-CI is measuring. Not whether AI can write code. Whether it can write code that compounds.

I've been trying to build that — in systems, in teams, in how I develop engineers — for years. The unit of measurement changes. The failure mode doesn't.

I still can't measure it precisely.

But when it's working, a user updates their app and feels something they can't name. That feeling travels up through your CMO. It reaches you.

That's the score that matters. And it's the one most AI governance conversations aren't yet designed to reach.

In Part 2: what happens when two engineering organisations face this at scale — and respond differently. Amazon instrumented AI across millions of orders. Stripe built 6 billion test runs a day. Same tools. What each organisation chose to trust, and how much, is the whole story.