Microsoft Almost Lost OpenAI Because of 173 Agents. Here's What Every Engineering Team Should Steal From This Disaster

#webdev #ai #programming #tutorial

When the story of how Microsoft nearly torched a trillion dollars of value broke on Habr last week, my Slack lit up. A former Azure Core engineer described walking into his first planning meeting at the Overlake / Azure Boost team and discovering that 122 engineers were seriously planning to port half of Windows onto a fingernail-sized ARM SoC with 4 KB of dual-port FPGA memory. The same organization had 173 management agents running on every Azure node — and not a single human in the entire company could explain why all 173 of them existed.

I want to be clear: we at Gerus-lab are not here to dunk on Microsoft. We ship infrastructure for a living, and we have made our share of dumb calls. But this story is the cleanest, most expensive case study I have read in years on a disease that quietly kills engineering organizations of every size. And the worst part? We see the exact same disease at startups with 8 engineers. The numbers are smaller. The mechanism is identical.

So let me tell you what we actually take away from this — and what we now do differently on every project we touch.

The "173 agents" problem is not a Microsoft problem

It is very easy to read that Habr piece and feel smug. "Lol, Microsoft, of course they have 173 agents nobody understands." That feeling is the trap. Because here is the uncomfortable truth: the moment your codebase crosses a certain complexity threshold, nobody on the team can fully describe what is running in production anymore. Not your senior engineer. Not your CTO. Not the person who wrote the deploy script three years ago and left.

We at Gerus-lab were called in last year to audit a fintech backend in Berlin. The team was 14 people. Their production cluster was running 41 sidecars and background workers across the main service. When we asked why, the answers split into three categories:

"It was there when I joined."
"We added it for compliance, I think? Ask Marek." (Marek had been gone for 18 months.)
"It does something with the queue. Don't touch it, it'll break things."

That is a 14-person company with a 41-agent problem. Microsoft just had the same disease at 122x the headcount. And the result was the same: an organization sprinting toward a cliff with the confidence of a drunk LLM, as that Habr author so beautifully put it.

If you want to read more about how we approach these audits, we wrote up our methodology here: How Gerus-lab runs an infrastructure audit.

Why "let's port it" is almost always the wrong answer

The single most damning detail in the Microsoft story is this: when the lead architect was asked whether they really planned to port Windows kernel features (COM, WMI, perf counters, NTFS hooks, ETW) onto a tiny Linux SoC, his answer was "we could at least ask a couple of junior devs to look into it."

Stop. Read that again.

This is the moment every engineering org dies a little. Somebody senior, who should know better, treats an obviously impossible piece of work as a junior task because admitting it is impossible would mean admitting the previous three years of architectural decisions were wrong. The political cost of saying "we built this on the wrong foundation" is so high that it is cheaper, career-wise, to assign two juniors to a death march and hope they fail quietly.

We at Gerus-lab have a rule we stole from aviation incident reports and have used on every backend rewrite we've shipped: the moment two senior engineers independently say "this cannot work," we stop the meeting and we go validate it on hardware that day. Not next sprint. Not after the planning offsite. That day. Because the cost of one afternoon of prototyping is always — always — cheaper than the cost of six months of polite consensus around an impossible plan.

We wrote about how we apply this principle to AI-pipeline projects in particular here: Why we kill 30% of features before the second sprint.

The real lesson: complexity is a liability, not a moat

Big tech has spent a decade telling us that complexity is a competitive advantage. "Look how many microservices we have. Look at our internal platform. Look at the org chart." The Azure story is the loudest possible counter-argument. Microsoft had infinite money, infinite engineers, and the best kernel talent on the planet — and they still managed to build a node management stack that maxed out at a few dozen VMs on a 400W Xeon, when the underlying hypervisor could do 1024.

That is not a resource problem. That is a complexity tax compounding for years until it ate the actual product.

When we at Gerus-lab build something for a client, we measure ourselves by a metric we call "explainability budget": can a brand-new engineer, on day one, with no help, draw the request flow on a whiteboard in under 15 minutes? If the answer is no, the system is too complex. We delete things until the answer is yes. Sometimes that means killing features the client paid us to build. We do it anyway, and we explain why.

This sounds aggressive. It is. It is also the reason our clients still talk to us three years after the project ends.

Three concrete things we now do differently after reading the Habr piece

The "agent census." On every audit, the first deliverable is a literal spreadsheet of every process, sidecar, daemon, cron job, and lambda touching production. For each row: what does it do, who owns it, what breaks if we kill it tomorrow. If we cannot fill in all three columns, the row gets a red flag. On the average mid-stage startup, 40 to 60 percent of rows get flagged on the first pass. That is the 173-agent problem in miniature.
The "junior dev test." Before we approve any architectural decision, we ask: if we handed this plan to a smart junior with no context, would they immediately spot a contradiction? In the Microsoft case, a junior who knew the SoC had 4 KB of FPGA memory and then heard "we will port WMI to it" would have raised their hand. Nobody in the room did. That tells you the room had a culture problem, not a technical problem.
The "kill switch meeting." Every six weeks on our active projects we hold a 60-minute meeting whose only agenda item is: what should we delete? No new features. No roadmap. Just deletion candidates. It is the most useful hour we spend, and it is the one meeting clients try to skip every single time. We do not let them.

If this resonates and you want us to come run one of these on your own infrastructure, our contact form is here: Talk to Gerus-lab. We do paid audits, but the first 30-minute call is free and we will tell you upfront if you do not need us.

"But we are too small for this to apply to us"

Every founder we talk to says this. Every single one. And they are wrong in the exact same way.

The complexity disease does not care how big you are. It cares about the ratio of decisions made under political pressure to decisions made under technical pressure. Microsoft was at maybe 95/5 in the wrong direction by the time the Overlake project happened. We have audited 6-person startups that were already at 70/30. The disease has nothing to do with company size and everything to do with whether your engineers feel safe saying "this is stupid" out loud.

If you want a brutally honest signal of where you are on that spectrum, try this: in your next architecture review, count how many times somebody says "it would be embarrassing to walk this back now." If the answer is more than zero, you are sicker than you think.

We've collected a few of these self-diagnostic checklists in one place — the Gerus-lab engineering health checklist — and we have given it away to every team that has asked. No email gate, no signup. Just the checklist. Because the cost of the next 173-agent disaster is a lot higher than the cost of teaching people to spot it early.

What we are taking away from all of this

The Microsoft story will get reduced to a meme. "Lol Azure has 173 agents." The actual lesson is much harder and much more useful: the difference between an engineering org that ships and an engineering org that dies is whether anyone in the room is allowed to be the kid pointing at the emperor.

We at Gerus-lab try, every day, to be that kid. We are not always right. We are sometimes annoying. But we have never — not once in five years — walked into a project where the team thanked us less for killing the wrong feature than for shipping the right one.

If your team is staring at a roadmap that feels heavier every quarter, and you are starting to suspect that nobody can actually explain what half your services do anymore, that is the warning sign. That is the moment to do something about it. Not next quarter. Not after the next funding round. Now.

Come talk to us. The first 30 minutes are free, the audit pays for itself in the first deletion, and we promise we will not try to port Windows onto your FPGA.

→ gerus-lab.com

If this piece resonated, the worst thing you can do is nod and close the tab. The best thing you can do is open your own infrastructure dashboard right now, count the number of running services, and ask yourself how many of them you could explain to a stranger in under a minute. That number is the start of your own 173-agent story.