Why most AI agents fail in production

#ai #agents #productivity #automation

Most AI agents work in demos. They break in production.

MIT's 2025 State of AI in Business report (Fortune summary) found that 95 percent of enterprise AI pilots show no measurable business return. The same study found something more telling. Pilots built by buying tools from a vendor succeed about 67 percent of the time. Pilots built internally succeed at about a third of that rate, roughly one in five.

That gap matches what we see every week. A team builds something that works on a sample of five cases. The first week of real volume, it falls apart.

The failures are not random. After watching this happen across recruiting, healthcare back-office, legal review, financial reconciliation, supply chain, and more, two patterns explain almost all of them.

Reliability and governance

Most prototypes are built to demo. Demos ask one question. Can the agent finish the task once?

Production asks different questions. Can it finish the task one thousand times in a row, on slightly different inputs, while no one is watching? When something goes wrong, can someone reconstruct what the agent saw, decided, and did? Can a human approve consequential steps before they happen, not after? When the auditor asks what we ran last quarter, is there a paper trail?

Most prototypes do not answer any of these. They were not built to.

AI agents make non-deterministic decisions. Without a step-by-step audit log, you cannot debug a bad run. Without explicit approval gates on actions that cannot be undone, the first time the agent runs against real data is the first time anyone notices the difference between reading a file and wiring money. Without access controls, secrets management, and retention policies, you do not meet the bar that any real enterprise asks for.

When something has to give, reliability has to win. No one consciously chooses fast and broken. But every prototype has implicit speed defaults. Short timeouts. No retries. No verification step. Those defaults survive into production unless someone redesigns them. A run that takes thirty seconds and is right is always worth more than a run that takes five seconds and is wrong ten percent of the time.

The teams that succeed treat reliability and governance as the architecture, not as features they bolt on at the end. They are not building "an agent". They are building a small operations system that happens to use a model. The model is the easy part.

Standard, not bespoke

It has never been easier to ship a prototype. A founder, an analyst, a recruiter, anyone can vibe code a workflow with Claude or ChatGPT in a weekend. That is a real productivity win at the prototype stage.

The moment that workflow is supposed to be how the company does something, the bespoke version becomes the problem.

Five people on a team need to do the same thing. They each end up with a slightly different flow. One person's version skips a step. Another adds an unauthorized vendor. A third silently uses an old API. They all work in isolation. None of them is the process.

There is no single source of truth. The flow lives in someone's chat history, or a doc, or a script in a Github gist. When the manager asks how we do something, there are five answers, each one slightly out of date.

When the upstream API or the policy changes, the change has to be made in five places. Usually it gets made in two and the rest drift. A month later, half the team is operating on the old version and no one knows.

You cannot deploy one hundred slightly different copies of a workflow and call that production. Production means one canonical version. Versioned, shared, observable, updated in one place. Vibe coded flows in one hundred chat sessions are the opposite of that.

What to do about it

Treat reliability as a v1 requirement, not a v2 cleanup. Ship audit logs from day one. Build the approval gate before you build the action. Define guardrails for destructive operations before the agent has the credentials to execute them. The cost of doing this up front is hours. The cost of retrofitting after a bad run is months and trust.

Pick speed or reliability for the right work, but pick consciously. Some workflows benefit from a fast best effort agent. Most enterprise workflows do not. If the work touches financial data, customer records, or anything a regulator cares about, default to reliability and accept the latency cost.

Centralize the workflow definition. Treat each automation like code. One canonical version, in a system everyone can see, with version history. Vibe coding is great for prototyping the first version. It is a disaster as the operating version. The transition from one to the other is the work.

Make governance someone's job, not no one's. One person owns access controls, audit, and exception handling for the agents you deploy. This is unglamorous and necessary.

Be honest about the production bar. "It worked in the demo" is a fact about the demo, not about the agent. Run the work on real volume, with real data, watch what happens to the failure modes, and fix them. That phase often takes longer than building the agent in the first place. That is normal.

What this really means

The bottleneck is not the model. The current models are more than capable. The bottleneck is the audit trail, the guardrails, the source of truth, the governance, the operational discipline.

That is good news. It means it is all addressable. The teams investing in those layers compound much further than the teams chasing the next benchmark.

If you are building an AI workflow that runs unattended on real work, take reliability seriously before you take speed seriously. Standardize the process before you scale it.

The teams that do are the ones we see succeed.