Part 2 of 5 in The New Engineering Contract - what it means to lead engineers when AI is doing more of the coding.
Stripe never skipped the boring stuff. They ship 1,300 AI PRs a week. Amazon skipped it. Their storefront went down for six hours. Kent Beck wrote the answer in Extreme Programming Explained in 1999. We read it. Then chose velocity anyway.
A friend of mine leads engineering at a funded startup.
Sharp person. Good instincts. We talk regularly about what's actually happening in engineering. Not the conference version. The real version.
Last month he told me something that has been sitting with me since.
His board had just seen another AI productivity deck. The kind with the 4.5x velocity slide. He said: "I need to show something in three weeks or I'll be the only person in the room without a number."
I've heard variations of this from almost every engineering leader I know right now. The pressure isn't coming from incompetence. It's coming from a genuine fear of falling behind, and a market that's rewarding speed over everything else.
But here's what I've been watching.
The organisations that are winning with AI didn't change what they valued when AI arrived. They automated what they already believed.
To understand why, you have to go back further than Amazon and Stripe. You have to start with a pattern most engineering leaders recognise but rarely say out loud.
The pattern nobody is talking about
There's an engineer who gets the call. Not once. Every day. Same time. Same issue. Same fix.
A CRON fails. Server goes down. Engineer restarts it. Gets praised in standup. Three years later, same engineer, same call, same restart, same appraisal comment: "great context, always available."
Nobody asks why the CRON still fails.
The engineer who quietly prevented three other issues from ever becoming calls? Invisible. No heroics. No story. No raise.
This is the default incentive structure of most engineering orgs. Not by design. By inertia.
Now AI is running the same pattern.
First output is wow. Demo runs clean. PR merges fast. Nobody asks what happens on commit 47. Nobody tracks whether the same regression is back next sprint.
AI didn't create this incentive problem. It inherited it.
Kent Beck described this failure mode in Extreme Programming Explained in 1999.
The cost of a bug rises dramatically the longer it goes undetected. Find it in development: cheap. Find it in production: expensive. Find it a year later in a system nobody understands anymore: catastrophic.
Paraphrased from Extreme Programming Explained, Kent Beck, 1999
Most teams read that. Nodded. Then optimised for velocity anyway.
Then AI arrived. The same cycle is now running at machine speed. Features fast. Bugs compound. Hero celebrated. Foundation ignored.
Kent Beck had one line for this moment too.
"Optimism is an occupational hazard of programming. Feedback is the treatment."
Kent Beck, Extreme Programming Explained, 1999
Amazon was optimistic. Stripe built feedback. The rest is six hours of downtime, 21,716 outage reports, and a checkout button that didn't work.
AI didn't create the problem. It just stopped hiding it.
Amazon's answer: make adoption the goal
November 24, 2025. An internal memo co-signed by SVPs Peter DeSantis (AWS) and Dave Treadwell (eCommerce) establishes Kiro, Amazon's own AI coding assistant, as the company standard. 80% weekly usage by year-end, tracked as a corporate OKR. Amazon reported 21,000 AI agents deployed across Stores, claiming $2B in cost savings and 4.5x developer velocity — numbers that made it to earnings calls.
The engineers closest to the work weren't celebrating.
Approximately 1,500 of them signed an internal petition. Their argument: the policy prioritised corporate product adoption over engineering quality. Senior AWS employees described what followed as "entirely foreseeable."
Leadership couldn't walk it back. By the time executive sign-off arrived, capex plans had ballooned toward $200 billion for AI hardware. The investment narrative was already public. Walking back the mandate would have meant admitting the story was wrong, in an earnings call, in front of investors.
The feedback was there. It just wasn't connected to anything that mattered to the people making decisions.
December 2025. Kiro, assigned to resolve a software issue in AWS Cost Explorer, autonomously decided the best approach was to delete and recreate the entire environment. 13-hour outage. China region.
February 2026. A second outage. Engineers let Amazon Q Developer resolve a production issue without intervention. Same pattern. Higher stakes.
Amazon's internal briefing note described "novel GenAI usage" with best practices and safeguards not yet established, and high blast radius as a recurring characteristic.
Here's what actually happened technically. The agent inherited a senior engineer's permissions and acted like one. Except it doesn't hesitate. There was no harness, no bounded scope, no deterministic guardrails, no approval gate for destructive operations. The model ran the system. The system didn't run the model.
Amazon built the agent. They forgot to build the harness. The missing harness took their storefront down for six hours.
The pattern: ship the capability, mandate adoption, discover the failure in production, add the guardrail. In every post-incident review, the framing shifts toward operator error. The tool is never the problem. The person who used it is.
Same cycle Beck warned about in 1999. Machine speed. Larger blast radius.
Stripe's answer: the model doesn't run the system
Stripe didn't wait for AI to care about feedback loops.
That infrastructure wasn't built for AI. It was built because Stripe believed what Kent Beck wrote: that feedback is the only treatment for the occupational hazard of optimism in engineering. They built it at a scale Beck couldn't have imagined in 1999. And when AI arrived, it plugged straight in.
When AI arrived at Stripe, they didn't scramble to add governance. They already had it.
Their AI agent system, which they call Minions, is built on what they call blueprints. Orchestration flows that alternate between fixed, deterministic code nodes and open-ended agent loops. As Stripe put it in their own engineering blog: "putting LLMs into contained boxes compounds into system-wide reliability upside." The model does not run the system. The system runs the model.
This is harness engineering. The agent operates within a defined scope, gets a maximum of two CI rounds, terminates at a pull request, and cannot take destructive actions without explicit gates. Engineers can still intervene, but the agent produces the whole branch without hand-holding.
The result: Stripe engineers are merging 1,300 pull requests every week with zero human-written code, on a codebase with hundreds of millions of lines, handling over $1 trillion in annual payment volume.
Not because their AI is smarter. Because their harness is tighter.
AI reliability scales with the quality of its constraints, not the size of the model. Most teams are learning this the hard way. Stripe learned it before they needed to.
And when something doesn't meet the bar, they remove it. Even features users love. Because a feature built on a weak foundation isn't a feature. It's debt with a good demo.
Rahul Patil, then CTO of Stripe and now CTO of Anthropic, speaking on Stripe's reliability culture in the context of the trust they maintain with payment partners and the financial infrastructure they operate, said something that has stayed with me. Reliability is a mindset, not a metric. You don't build it when you need it. You build it before you know you'll need it.
The teams winning with AI didn't change what they valued when AI arrived. They automated what they already believed.
What this looks like when you're not Stripe
I was building a critical frontend layer at Medibuddy. The first thing every user touches. The thing that gets blamed when anything feels slow, broken, or wrong, even when the problem is somewhere else entirely.
We were preparing for a critical event. Load testing time.
My team wanted to celebrate that it held at 3X load.
I wanted to know where it breaks at 10X.
Here's why that matters. At 3X, response times look acceptable. At 10X, they degrade, and they don't degrade equally. The user on a high-end phone with broadband barely notices. The user on a low-end Android device on a 3G network in a tier-3 city gets the worst of it. In a health platform, that user is often the one who needs the service most.
The breaking point isn't about finding failure for its own sake. It's about knowing exactly where your system starts punishing your most vulnerable users, so you can build a roadmap with real data instead of comfortable assumptions. Without that number, every platform decision is a guess. With it, you know what to fix first and why.
My team called me a borderline psycho.
I didn't have a name for what I was doing. I just knew that celebrating 3X without knowing where 10X breaks is guesswork dressed as confidence.
Stripe calls it practicing your worst day every day.
I was doing it at Medibuddy by instinct, without knowing it had a name, without the cultural backing, while my team pushed back.
The principle doesn't require Stripe's infrastructure. It requires the decision to care about the foundation before the incident tells you to.
If your team has ever called you difficult for asking the uncomfortable question, you weren't being difficult. You were doing the job nobody celebrates until the system breaks without it.
The thing most teams are missing
Evals are test cases. Skills files are documentation. Agent loops are CI pipelines.
Nobody wants to hear this because it means the AI transformation project is actually a culture and discipline project wearing a technology hat.
If your team couldn't write tests before AI, they can't write evals now. If they didn't write documentation before AI, skills files will be ignored. If they didn't build feedback loops before AI, the agent loop will generate failures faster than anyone can review them.
The model is not the risk. The system around the model is the risk. Most teams are buying models and skipping systems.
This is where the headcount conversation becomes dangerous.
This reminds me of a conversation I had with a senior leader at a previous company. Half-joking, half-serious, they looked at me and said: "Since you are already using AI, leveraging it and delivering faster, you can probably cut the team by 50% and still deliver the same output, right?"
It's the kind of comment that sounds like a compliment. It isn't.
It assumes AI is a headcount equation. Pick up the tool, drop the headcount. Nobody asked what the tool runs on.
My answer: same team, same timeline. But 50% better quality, maybe 100%. That is what AI actually unlocks when the foundation is already there.
Amazon had 21,000 agents and no harness. The agents found every gap in the system. Stripe had the harness first. The agents plugged into it cleanly.
AI didn't create the gaps. Speed found them. AI just made the finding public.
Whether it's my friend's board meeting or yours
Two numbers. That's what actually matters to bring.
Change failure rate before and after AI tools. If it's rising, you don't have a quality contract yet. You have an adoption OKR.
Time for a regression to surface. How long between a broken deploy and someone knowing about it? If that number is measured in days rather than minutes, your harness isn't built.
If you don't have those numbers, that's the answer. Not about AI. About whether your foundation exists at all.
But here's what the numbers won't tell you. Numbers are a lagging signal. The culture that produces them is the leading one.
Amazon's engineers knew. 1,500 of them said so in writing. The culture didn't hear them because it was optimising for a different signal. Adoption rate, velocity, the 4.5x slide.
The engineering leaders who will navigate this decade aren't the ones who adopt AI fastest. They're the ones who build teams where an engineer can raise a concern without being dismissed. Where a slow test suite is treated as a system problem, not a productivity complaint. Where maintaining something well is as celebrated as shipping something new.
Speed without a culture of ownership, feedback and accountability doesn't compound. It just breaks faster.
Build the harness. Build the culture that maintains it. Then bring the number.
The boring engineering you did before AI arrived? That's the moat now. Stripe proved it. Amazon proved it differently.
In Part 1- AI Agents Don't Fail at Code. They Fail at Learning, I wrote about how AI agents fail not at writing code but at maintaining it — and how I realised I had never measured maintainability precisely either, for AI or for my own team.
In Part 3, I'll write about what happened when I tried to build with AI myself. Burned $100. Blamed the model. Took a break to move out of FOMO and anxiety. Came back with one question nobody is asking: if AI mimics the person in front of it — what happens when that person has nothing left to teach it?
Further reading
- Amazon Kiro AI AWS Outages — Timeline of Amazon's AI mandate and resulting incidents
- Amazon AI Code Review Outages and Senior Approval — The internal petition and what followed
- Amazon.com March 2026 outage — Six hours of checkout failure
- Stripe Engineering: Minions — How Stripe's one-shot coding agents work
- Stripe's engineering culture (Pragmatic Engineer) — The 6B tests/day infrastructure
- Stripe Sessions 2024 — Building a culture of system reliability — Rahul Patil on reliability as a mindset
- Extreme Programming Explained — Kent Beck, 1999




Top comments (0)