Kevin

Posted on Mar 11

Everyone's Building AI Agents. Almost None of Them Work.

#ai #machinelearning #programming #productivity

Everyone's Building AI Agents. Almost None of Them Work.

I spent last weekend doing something mildly depressing: auditing every AI agent our team has shipped over the past eight months.

Out of eleven. Two work reliably. Three work sometimes. The rest are either dead, zombie-running in a cron job that nobody checks, or actively doing subtle damage that we only caught because a customer complained.

We are not uniquely bad at this. I've talked to enough engineers at other companies to know this ratio is pretty standard. The dirty secret of the agent gold rush is that most of what gets called an "agent" is a fragile prompt chain held together with hope and try/except blocks.

What We Actually Built vs. What We Thought We Were Building

The pitch for agents is irresistible: autonomous systems that reason, plan, use tools, and get stuff done without hand-holding. And that pitch lands because the demos work. You watch a model spin up a browser, navigate to a site, extract data, write a summary, send an email — and you think: we need ten of these.

So you build them. And they work great. In the demo.

Then they hit production. Where the HTML structure changes. Where rate limits exist. Where an API returns a 503 instead of a 200 and the agent confidently continues anyway, hallucinating the response it expected. Where the task that took 40 tokens in testing balloons to 3,000 because the context got polluted three steps ago and now the model is confused about which task it's even doing.

The gap between "it works in a notebook" and "it works at 2am on a Tuesday when nobody's watching" is enormous. With agents, that gap is a canyon.

The Reliability Problem Is Structural

Here's what I think most teams get wrong: they treat agent reliability as a prompt engineering problem.

It's not. It's a systems problem.

When you chain five LLM calls together, each with 95% reliability, your end-to-end success rate is around 77%. Add two more steps and you're at 66%. This is basic probability, but people keep being surprised by it when their eight-step agent starts failing a third of the time.

The fix isn't better prompts. It's the same stuff that makes any distributed system reliable — circuit breakers, retries with backoff, idempotency, explicit state machines, observable failure modes. Basically, the boring infrastructure work that nobody wants to do when they're excited about the cool AI part.

Our two agents that actually work? Both have explicit state machines. Both log every intermediate step to a database, not just the final output. Both have humans in the loop for anything consequential. They're not impressive demos. They're reliable tools.

The Tool Use Problem Is Worse Than People Admit

Most agents fail at the boundary between the LLM and the real world. Tool use.

Not because models are bad at calling tools — they've gotten genuinely good at that. The problem is that real-world tools are messy, stateful, and full of edge cases that nobody documented.

I watched one of our agents spend 20 minutes in a loop because a file upload API returned a 200 with an error message in the JSON body instead of a proper 4xx. The model saw 200, assumed success, tried to proceed, got confused, retried, saw 200 again. Classic success-that-isn't.

We've also had agents that handle the happy path beautifully and completely fall apart the moment something unexpected happens — a permission error, a missing field, a paginated response where it expected a single object. Models will improvise. Sometimes brilliantly. Often catastrophically.

You have to handle every failure mode explicitly. Which means you need to know what every failure mode is. Which means you need to actually understand the APIs and systems your agent is touching. There's no shortcut where the model just figures it out.

The Context Window Is a Liar

Long context windows were supposed to solve agent memory problems. 128k tokens, 1M tokens — surely the model can just... remember everything?

It can hold everything. That's different from remembering it.

Attention dilutes. Relevant information that was injected early in a long context gets treated differently than the same information at the end. We've had agents that "forgot" their own instructions halfway through a long task because the instruction was buried under thousands of tokens of intermediate output.

The practical fix is aggressive context management — summarize intermediate steps, prune irrelevant history, keep the working context small and focused. Which is, again, tedious systems work rather than magic.

So What Actually Works?

After eight months of this, here's my honest read:

Narrow, well-scoped tasks. Not "research this topic and write a report." More like "extract these five specific fields from this document format and return them as JSON." The more constrained the problem, the more reliable the agent.

Humans at decision points. The best agents I've seen aren't autonomous, they're assisted. They handle the boring parts, surface decisions to humans, get confirmation, continue. This isn't a failure of the vision — it's actually the right design for most use cases.

Observable everything. If you can't see what your agent is doing step by step, you can't debug it, you can't trust it, you can't improve it. Log aggressively. Build dashboards. Treat agent monitoring like production service monitoring, because that's what it is.

Boring infrastructure. State machines. Retry logic. Explicit error handling. Timeouts. The same stuff you'd build for any unreliable external service.

The agents that work aren't the ones with the most impressive capabilities. They're the ones built by engineers who stopped being impressed by the capabilities and started treating reliability as a first-class requirement.

The Hype Will Settle Down

We're in the part of the cycle where everyone's announcing agent products and the bar for what counts as "working" is pretty low. That'll change. Users have short patience for tools that fail mysteriously, and businesses have even shorter patience for autonomous systems that do damage without clear accountability.

The teams that come out ahead won't be the ones who shipped the most agents. They'll be the ones who figured out which problems actually benefit from agent architecture, built them right, and have the receipts to show they work.

That's a smaller set than the current hype suggests. But it's a real one.

Start there.

DEV Community

Everyone's Building AI Agents. Almost None of Them Work.

Everyone's Building AI Agents. Almost None of Them Work.

What We Actually Built vs. What We Thought We Were Building

The Reliability Problem Is Structural

The Tool Use Problem Is Worse Than People Admit

The Context Window Is a Liar

So What Actually Works?

The Hype Will Settle Down

Top comments (0)