The demos are gorgeous. An AI agent receives a vague task, spawns sub-agents, calls a dozen APIs, recovers from errors gracefully, and delivers exactly what you needed. Watching it work feels like the future.
Then you actually try to ship it.
Three weeks later you're debugging why the agent decided to delete a customer record instead of updating it, why it's in an infinite retry loop at 3 AM burning $400 in API costs, and why it confidently told a user their order would arrive "tomorrow" when there's no order at all.
I've spent the better part of the last year building AI-powered systems for real internal tooling — not prototypes, not demos. And here's what I've learned: agents are almost never the right answer. Tools are.
The Agent Fantasy
The appeal of agents is obvious. You describe what you want in plain English, and the system figures out the rest. No brittle pipelines, no hand-written logic for every edge case. The model is the orchestrator.
But that's also the problem. LLMs are phenomenal at generating plausible-sounding text. They're much worse at reliably executing multi-step processes where every step needs to be correct. You're essentially betting your production system on the model making a series of good decisions in sequence — and errors compound.
Miss one detail in step 3 of a 10-step chain? The remaining 7 steps might execute flawlessly... against the wrong data.
What Actually Works
Tools — discrete, single-purpose functions that the model can call — are a different story.
The difference feels subtle but it's not. A tool does one thing: search this database, send this email, parse this date string. The model decides which tool to call and what to pass it, but the tool itself is deterministic. You own the logic. You can test it. You can audit it.
Agents delegate control. Tools keep control with you.
When I rebuilt a customer support system last fall, the first version was agent-based. A single system prompt told the model to "handle customer inquiries" and gave it access to a CRM, ticketing system, and email API. It worked beautifully in testing. In production, it occasionally decided to proactively close tickets that were still open, apologize for issues that didn't exist, and once — memorably — escalated a routine shipping question to the VP of Operations.
Version two gave the model exactly three tools: get_order_status, create_ticket, and send_canned_response. The model picks which one fits. Humans handle everything else. It's been running for four months without a single incident.
The Benchmark Problem
Part of why agents get oversold is how they're evaluated. Benchmarks show an agent completing complex multi-step tasks with impressive accuracy. 85% on WebArena! 70% on SWE-bench!
Those numbers sound great until you realize 85% means 1 in 6 tasks goes wrong. In a benchmark, a wrong answer is a data point. In production, a wrong action might mean a charge to a customer's card, a deleted file, or a sent email you can't unsend.
The failure modes in agentic systems aren't just wrong answers — they're wrong actions. That's a completely different risk profile, and most teams aren't thinking carefully enough about it.
Where Agents Actually Make Sense
I'm not saying agents are useless. I use them constantly. Just not in production user-facing systems.
For internal research tasks? Great. An agent that pulls competitor data, summarizes analyst reports, and drafts a briefing doc — where a human reviews the output before anything happens — is genuinely useful. The human is the last gate.
For development workflows? Also solid. Cursor, Copilot, whatever you're using — these are effectively agents operating in low-stakes environments where you review every suggestion before it runs. Cost of a bad suggestion is a few seconds of your time.
The pattern that works: agents for drafts, tools for actions.
Generate content with agents. Execute decisions with tools. Never let an agent take an irreversible action without a human checkpoint.
The Framework Trap
LangChain, AutoGen, CrewAI, LlamaIndex — there are now roughly 40 frameworks promising to make it easy to build autonomous AI systems. Most of them make it very easy to build impressive demos.
Production is where the complexity hits. These frameworks abstract away a lot of the underlying mechanics, which sounds helpful until you're debugging why your agent is making weird decisions at step 8 of a complex pipeline. The abstraction that saved you two hours in setup costs you two days in debugging.
I've mostly moved to building directly on model APIs with minimal framework scaffolding. More boilerplate upfront, but I actually understand what's happening. When something breaks — and it will — I can find it.
What I'd Tell My Past Self
Start with a simple decision tree. Seriously. Map out what your system needs to do, identify where natural language understanding adds real value (parsing intent, extracting information, generating responses), and keep everything else as boring, testable code.
Add model calls only where they genuinely improve on a deterministic approach. Don't reach for an agent because it feels more impressive or more "AI-native." Users don't care about your architecture. They care whether the thing works.
Build in reversibility. Any action your system can take should have an undo path or a confirmation step. If it can't be undone, require human approval. Full stop.
Log everything. Model inputs, outputs, tool calls, costs. You'll need it when something goes wrong.
The future is probably more agentic than what I'm describing. Models are getting better at multi-step reasoning, reliability is improving. Maybe in two years the calculus changes.
But right now, in March 2026, if you're shipping production AI systems that real users depend on: keep it simple, keep humans in the loop, and build tools not agents.
The demos can wait.
Top comments (0)