Or, why AI systems fail in surprisingly familiar ways.
With the introduction of AI-assisted coding tools and agents, many people were hoping we’d solve all the problems of human teams. Without the people, politics, and personalities software delivery would be easy.
This vision is appealing because we’ve long blamed people for the failures in the system. But the evidence from multi-agent AI provides an awkward truth. The same classes of failure still happen in the same places, for reasons that have nothing to do with humanity.
So, why do large batches of work still fail once you’ve eliminated laziness and stupidity?
Project Village
Traditional thinking on software projects is like a strange village. It has been cut off from the rest of the world for thousands of years and has developed some interesting beliefs about the laws of nature. They carpet their walls, which makes it hard to vacuum. They make their roofs out of glass and struggle with indoor temperatures. And they make clothes from hair, because nobody is allergic to their own hair.
Perhaps the strangest thing about the village is that they believe gravity is an act of human willpower.
All the animals in the valley are birds, which seem to cling to the branches of bushes, else they glide away into the sky. They also have no fruit trees, which means they’ve never had a Newtonian epiphany. Without evidence to the contrary, their collective delusion has persisted for many generations.
If you visited the village, you’d hear such philosophical musings as “The earth keeps those who keep themselves,” and “I will, therefore I stand.” Every Wednesday they gather in the town hall to remind each other that “If you stand firm in mind, the ground will rise to meet you.”
To tackle the problem of carpet cleaning, the village invented a cast iron humanoid robot that vacuum-cleans the walls. To stop the robot floating away, they built suction devices into its heels. One day, a robot is being moved between houses when its power fails.
To their amazement, the robot doesn’t float away. The village philosophers gathered together to discuss this strange event. The only logical conclusion they could reach was that the robot had developed human willpower, which allowed it to remain earthbound. How else could it stay on the ground?
It’s Not Willpower, It’s Gravity
Outside of the village, we all know that gravity works on robots, just like it works on people, birds, and rocks. That doesn’t prevent us from making a similar mistake when it comes to building software.
Large software projects frequently fail and the people in charge blame human factors. Blaming people for these failures because they’re slow, lazy, dim-witted, and require rest is like the village believing gravity is a manifestation of willpower.
The reality is that software delivery has a form of gravity that increases relative to batch size. This is why swarms of AI agents fail in all the same ways teams do when you give it large and complex tasks.
In The Organizational Physics of Multi-Agent AI, Jeremy McEntire describes an experiment where the same multi-service backend system was created by different arrangements of AI agents. There is a belief that co-ordinating a team of AI agents will allow more complex tasks to be handled by AI, but the result of the experiment was that the coordination complexity outweighs the benefits of the division of work between multiple agents.
The failures of multi-agent work look familiar. They are the same as the failures of large projects, except there are no humans to blame. This proves the challenges are inherent to complex work and cannot be attributed to human factors.
It strikes me that we have been blaming human factors for bad systems for too long and we need to acknowledge they’re not the reason for past IT system failures.
A Babbling Equilibrium
“When I use a word,” Humpty Dumpty said in rather a scornful tone, “it means just what I choose it to mean — neither more nor less.” — Lewis Carroll, Alice’s Adventures in Wonderland
When I say “rock”, some of you will think of geology (rocca comes from Latin) while others will think of music (roccian comes from Old English). A select few may think of a seaside hard-sugar treat, or a tool used to hold unspun fibers.
For something as simple as a word, I can reduce your interpretive preference by putting the word into a sentence. When more complex communication takes place, it’s rare to have perfect alignment on the meaning we intend to communicate. Communication precision reduces from perfect alignment and can degrade to a babbling equilibrium where no information is received in the message.
This misalignment can be understood through the pre-DevOps problem of having a development team measured for throughput and an ops team measured for reliability. When you communicate the same information to these two silos, interpretation will vary drastically.
This challenge in transferring information remains when a human instructs an AI agent to do work. Hence the counter-arguments of “you’re prompting it wrong.” What surprises more people is that it remains when AI agents send messages to other AI agents. This is why the multi-agent configurations performed worse than single-agent setups.
When you understand why agent swarms compound the problem rather than solving it, the answer starts to look familiar. These aren’t new problems. Long before AI agents, human development teams were failing for exactly the same structural reasons. The teams that recovered did so by tackling the structure, not the people.
Fixing the System of Work
In my past roles, I was often asked to join a company to “fix the development team.” They had reached a stage where every deployment turned into a high-severity incident, every change resulted in highly visible bugs, and the business had lost faith in the ability of the team to deliver working software.
The teams usually had basic tools in place, like version control and automated builds. What was missing was the rest of the deployment pipeline, and this was the root cause of all the problems. Here’s how I’d fix it.
Deployment automation made production releases repeatable and reliable, removing the most painful and wide-reaching failures of these teams. This led to more frequent deployments, which in turn reduced the size of work batches. Breaking work into smaller steps is a good way to improve communication success.
Test automation increased our confidence that the software was deployable. It wasn’t uncommon to find teams with no automated tests, so adding characterizing tests around the most important features reduced the number of embarrassing software versions the team produced, such as the kind where nobody can even sign in.
Monitoring and alerting helped the team understand how the system ran in production. As we fine-tuned the tools, the team became the first people to know when there was a problem, or the early signs a problem was emerging. By prioritizing work that kept the software healthy, we improved the relationship with the business, including executives who dealt with complaint escalations, and we made customers happier with the software.
The result of these three changes isn’t just improvements to deployments, software quality, and observability. Having a complete pipeline lowers your batch size. This keeps complexity low and solves the communication and co-ordination problems that become insurmountable in large batches of work. The kind of problems that make it impossible for AI agents to deliver working software, just as they prevented teams from doing it.
Batches Have Gravity
There are fundamental laws of software delivery that mean batches have a gravitational force that becomes unmanageable by human teams or AI agents. The larger the batches, the heavier everything gets, increasing the amount of energy needed to move things.
Rather than searching for bigger tractors to move giant objects, organizations with modern software engineering practices make everything fit in a lightweight backpack. Small batches are the secret sauce behind Continuous Delivery and the DORA research increases our confidence in this approach.
Just as Fred Brooks observed in “The Mythical Man Month”, adding people to a late software project makes it later. McEntire’s research suggests this applies equally to situations where you simply increase the number of AI agents tackling the work.
Continuous Delivery remains the best way to deliver software, no matter who or what is writing the code.
Top comments (0)