Over the past year, AI agents have gone from research experiments to one of the hottest topics in tech. Social media is full of demos showing agent...
For further actions, you may consider blocking this person and/or reporting abuse
This is one of the most honest breakdowns of agentic AI I've read. The gap between demo and production is where most teams quit.
I've built two production AI agents that operate across multiple business verticals — one handling business operations, one handling security operations. Everything you described here is real. The planning problem, the tool fragility, the memory architecture, the infinite loops. I've hit all of it.
A few things I learned the hard way:
Tool-level permissions solve the reliability problem better than prompt engineering ever will. Every tool gets explicit read/write/execute scopes per user. The agent physically cannot perform an action the user's tier doesn't allow. That eliminates an entire class of failures.
Human approval gates on destructive operations are non-negotiable. The agent can plan, recommend, and stage anything — but delete, send, or deploy requires a human confirmation. This one design decision prevented every "duplicate email blast" scenario you described.
The memory problem is real but overstated. Most agents don't need to remember everything. They need to remember the right things at the right time. Scoped context per task with a retrieval layer for historical data beats stuffing the full history into the prompt window every time.
The teams that win won't have the smartest models. They'll have the strictest guardrails.
This is such a valuable perspective. Thank you for taking the time to share it.
I especially loved your point about tool-level permissions. It's interesting how so many conversations around agents still revolve around prompts and model choice, while the real reliability gains often come from what looks like "boring" systems engineering: permissions, scopes, approval workflows, and guardrails. The fact that explicit read/write/execute boundaries eliminated an entire class of failures for you says a lot about where the industry actually needs to focus.
I also agree with your take on memory. The more I researched this piece, the more I realized that the challenge isn't teaching agents to remember everything. It's teaching them what deserves to be remembered in the first place. Context without prioritization quickly becomes noise.
And your final line really stuck with me: the teams that win won't have the smartest models, they'll have the strictest guardrails. I genuinely think that's one of the biggest lessons we're learning as we move from impressive demos to production systems. Thanks again for adding this. Comments like these make the discussion far more valuable than the article alone.
Appreciate you saying that. And you nailed it — "boring systems engineering" is exactly the right framing. The industry has a fascination with making agents smarter, but the production breakthroughs almost always come from making them more constrained. Smarter models with no guardrails just fail more creatively.
On the memory point — one thing I've found useful is treating agent memory like a security scope, not a knowledge graph. Instead of "remember everything and retrieve what's relevant," we define what the agent is allowed to retain per session, per task, per user. It turns memory from an open-ended retrieval problem into a policy problem. Much easier to debug, much harder to leak context across boundaries.
The other thing nobody talks about: approval gates. Not just "human in the loop" as a checkbox, but actual confirmation workflows before any irreversible action — sending, submitting, spending, deleting. Once you treat agent autonomy as something that has to be earned per-action rather than granted globally, the failure modes shrink dramatically.
Would love to hear if you've seen teams handle the state management side well. That's the next frontier I keep running into — agents that can plan multi-step workflows but lose coherence when one step fails midway.
I really appreciate you sharing these insights from actual production deployments. As someone who approached this topic from a research and technical writing perspective, hearing what people have learned in the trenches adds a completely different dimension to the discussion.
I especially hadn't fully appreciated how much reliability can come from seemingly "ordinary" engineering decisions like permissions and approval workflows rather than increasingly sophisticated prompts. It's a good reminder that building trustworthy systems is often less about chasing intelligence and more about designing sensible constraints.
Thanks again for taking the time to contribute. Conversations like this are one of the reasons I'm enjoying being part of the Dev.to community.
That means a lot, honestly. The best technical writing does exactly what your post did — it gives practitioners a framework to articulate what they're already experiencing but haven't had the language for. Looking forward to your next piece.
Thank you! That's probably one of the nicest compliments a technical writer can receive. I really appreciate you sharing your real-world experience here, it added so much depth to the conversation.
The "buys office supplies twice" failure is the one I'd put at the top 🙂, and it's interesting because the fix isn't an AI problem at all. The second an agent takes actions with side effects, you're back in distributed-systems territor. Retries plus non-idempotent operations equals duplicates and you solve it the boring old way, idempotency keys Iwould say, dedupe on the tool side, compensating actions for rollback. Most of the reliability work is making tools safe to call twice, not making the model smarter about calling them once. The SWE-bench point lands too, demo success and unattended run success are just completely different distributions.
That's a great point and I think it highlights something the AI discussion often overlooks. Once an agent starts interacting with real systems, many of the hardest problems stop being purely AI problems and start looking a lot like traditional software engineering and distributed systems challenges.
I especially like your observation about idempotency. It's easy to focus on making the model smarter, but in many cases reliability comes from designing tools and workflows that remain safe even when the model makes mistakes, retries requests, or behaves unpredictably.
And I completely agree on the demo-versus-production gap. A successful demo proves an agent can complete a task once. A production system has to survive failures, retries, unexpected states and edge cases thousands of times. That's a very different challenge altogether.
Thanks for adding this perspective, it connects the AI conversation back to some timeless engineering principles.
You said it best. I also liked the way you described testing. Since AI is probabilistic, it really is hard to imagine how to measure it's performance. This is also led me to a better understanding of machine learning. AI is not thinking, it is only guessing based on it's training data. But still, developers are uncomfortable on letting AI run all the work automations. It's like betting it will find the correct path every single time in a chaotic work environment. If it takes the wrong path, how are we so sure AI will automate it's way back to the right one.
I really like this perspective, especially your point about AI finding its way back after taking the wrong path.
I think that's where a lot of the discomfort comes from. Most developers aren't worried about whether AI can get things right occasionally, we've all seen impressive demos. The real question is what happens when it gets things wrong in a messy, unpredictable environment. Can it recognize the mistake? Can it recover gracefully? Or does it confidently continue down the wrong path?
Maybe that's why reliability and guardrails have become such an important part of the conversation around AI agents. The challenge isn't just teaching systems how to act, it's deciding under what conditions we can trust them to act on our behalf.
But actually the loop issue is caused by the model's measurement. I think an effective way to solve this kind of problem is to do loop repetition detection for the agent. Actually, the root cause doesn't lie with the agent.
Agents are hard because they do not just generate text. They make decisions and call tools.
That means you need permissions, memory, evals, logs, retries, human review, and guardrails. Without that, the “agent” is just an LLM with too much access.
The tool-calling fragility section is where most teams hit the wall first, in my experience.
The gap between "it works in the demo" and "it works reliably in production" usually comes down to two things you've identified: output parsing and error cascade. What I'd add is that the fragility compounds with chain length. A single-tool call that fails 10% of the time is annoying. The same tool called 5 times in a workflow that fails 10% per call succeeds the whole chain less than 60% of the time. Teams that don't instrument individual tool call success rates never see this until they're deep in production debugging.
The planning degradation problem (the Arizona State study) maps cleanly onto something I've watched happen: agents perform well on the happy path you tested but fall apart on the first slightly novel state. The underlying issue is that LLMs are trained to complete patterns, not to recognize when they've hit a genuinely novel situation that requires a different plan. They'll confidently proceed with a stale assumption rather than surface the ambiguity.
The architecture answer — constrain the planning surface, verify state at checkpoints, make the human the fallback for high-stakes decisions — isn't exciting to demo but it's what actually ships.