The Night Everything Broke
Two hours. That's all it took to lose months of project context — not to a system crash or a rogue developer,...
For further actions, you may consider blocking this person and/or reporting abuse
In a sort of morbid way, I look forward to seeing how long it takes the bros to realize the emperor is naked. There are uses for generative AI, but they're not nearly as ubiquitous as we're led to believe.
Also - I went through a phase of letting AI refine my writing. It bit me hard and I've since reduced it to helping me form an outline, and once I have the general thread organized I write everything myself.
The writing experience you described is actually a perfect parallel to what I saw with the backlog. Both cases: vague instruction, high-stakes output, no guardrails. Refine my writing is almost as ambiguous as organize my backlog.
Using it for outlines only is exactly the narrow-scope approach that works. You've essentially built your own human-in-the-loop AI handles the structure, you handle everything that actually matters. That's not a limitation, that's the right architecture.
And just to say the scary part out loud- I don’t think that’s what most people are thinking about. All I see is “here bot write more” whether it’s prose or code.
And both professions are going to pay for that.
That framing is exactly right, and it's the part that's hard to say without sounding alarmist. Here bot write more is the default mode because volume is measurable and judgment isn't. You can track words per hour. You can't track the slow erosion of knowing when not to write, or when code is solving the wrong problem entirely.
The professions that survive this will probably be the ones where the cost of getting it wrong is immediate and undeniable like surgery, or structural engineering. Writing and coding are softer feedback loops. You don't always know the damage until much later.
writing one feels worse tbh. tickets break and you see it. with text it looks better at first and you dont really notice when its solving the wrong thing
The four failure patterns are spot on, and I'd argue the Coordination Problem is the most underestimated of the four. I run about 10 scheduled AI agents across a portfolio of projects — content generation, site auditing, SEO monitoring, community engagement. Each one individually works fine when scoped to a single task. But the moment they start touching the same data or need to coordinate outputs, things get interesting fast.
The pattern that actually works for me: each agent writes to its own log file, a separate review agent reads all the logs once a week, and a human (me) makes every decision that involves changing production data. It's not glamorous. It would never make a demo reel. But nothing has been silently deleted in months.
Your "infrastructure before capability" principle should be tattooed on every team building with agents right now. I spent more time building guardrails — deploy safety checks, minimum page count thresholds before syncing to production, content validation pipelines — than building the actual generation logic. And that's the only reason the system hasn't eaten itself.
The naming point from the comments is also underrated. "Agent" implies judgment. What most of us are actually building is closer to "scheduled automation with an LLM in the loop." Less exciting, but way more honest about what the system can and can't do.
The architecture you're describing each agent writes to its own log, a review agent reads all logs weekly, human makes every production decision is exactly what "infrastructure before capability" looks like in practice. Not glamorous, not demo-reel worthy, but nothing silently deleted in months. That's the metric that actually matters.
The coordination problem being the most underestimated of the four tracks with what I've seen too. Individual agents are manageable. The moment they share state or need to sequence outputs, you've introduced distributed systems complexity on top of LLM non-determinism. Most teams don't realize they've built that until something breaks in a way that's genuinely hard to reproduce.
Scheduled automation with LLM in the loop I want to use this. "Agent" carries too much intent. It implies the system has judgment, context, accountability. What most of us are actually running is closer to a very capable cron job that sometimes surprises you. The naming matters because it sets the expectation. And wrong expectations are what leads to giving it production access it shouldn't have.
The guardrails taking more time than the generation logic is the part nobody talks about in the demos. That ratio more time on safety than capability is probably the most honest signal that a system is production-ready.
The naming distinction you're drawing is really sharp — "scheduled automation with LLM in the loop" is a much more honest description of what most of us are actually running. The word "agent" does set expectations that lead to giving these systems too much autonomy too fast.
Completely agree on the guardrails-to-generation ratio. I spend probably 60% of my time on safety checks, dedup logic, and blast radius limits — not on the LLM prompts themselves. The generation is almost the easy part. The hard part is making sure a crashed run doesn't silently corrupt state or double-post something irreversible.
This is exactly what I meant thanks for bringing real numbers to it. 🙏
60% on safety checks that's the kind of ratio most people don't realize until they've actually built something at scale. The LLM part is fun, so it gets all the attention. But the boring stuff (dedup logic, blast radius limits, state management) is what actually determines if something is production-ready.
And you're absolutely right about the crash risk silent corruption or irreversible double-posting is the nightmare scenario. That's why "agent" is such a dangerous label. It makes people skip the guardrails because they think the model is smarter than it is.
Really appreciate you sharing this real-world experience > theory every time. 🙌
100% agree on the "agent" label being dangerous. I've seen it firsthand — when you call something an agent, people assume it handles edge cases intelligently. But the reality is most of the reliability comes from boring deterministic checks, not the LLM.
The 60% safety ratio honestly surprised me too when I first measured it. You expect the AI part to dominate the codebase, but it's really the validation layer, retry logic, and idempotency checks that make the difference between a demo and something you trust to run at 3am unsupervised.
Great framing on "production-ready" vs "impressive demo" — that's the real divide in this space right now.
This right here. 🙌
Something you trust to run at 3am unsupervised that's the real definition of production-ready. Not benchmarks, not demos. Just quiet, boring, reliable automation that doesn't need a babysitter.
The fact that you measured the 60% ratio and it surprised you too that tells me most people are probably running at 80-90% without even realizing it. The AI hype hides the real cost.
Really appreciate this thread conversations like this are more valuable than any benchmark. 🚀
Partially agree, partially disagree — and I think the nuance matters.
You're right that most "agentic AI" today is overhyped wrapper layers around LLM calls that barely qualify as agents. The demo-to-production gap is enormous. Most fail at the first unexpected edge case.
But here's where I push back: the problem isn't that agentic AI is fundamentally overhyped — it's that most teams are building agents that only generate text. The real unlock comes when agents take real actions.
We build AnveVoice (anvevoice.app) — a voice AI agent that takes actual DOM actions on websites. Clicks buttons. Fills forms. Navigates pages. Not simulated, not sandboxed — real operations on live sites. The engineering challenge is genuinely hard (sub-700ms latency across 50+ languages while maintaining safety guardrails), but the value proposition is clear and measurable.
The hype is real for text-generation agents repackaged as "agentic." The potential is also real for agents that actually execute in the real world. The industry just needs to stop confusing the two.
That's a fair and important distinction and honestly one I should have drawn more clearly in the article.
Text-generation wrapped in an agent loop ≠ actual agentic behavior. You're right that the real unlock is when agents take irreversible real-world actions.
But that's also exactly what scares me. The higher the stakes of the action — clicking buttons, submitting forms, navigating live sites the more catastrophic the failure mode when it goes wrong.
AnveVoice sounds like it's doing this right though sub-700ms with safety guardrails is not a wrapper, that's real infrastructure. How are you handling edge cases where the agent misidentifies the target element on an unfamiliar site?
Agent-based development really has both benefits and potential issues, but unfortunately not many people talk about this. In my opinion, about 70–80% of people are fascinated and just blindly follow the trends. Recently, I did my own research where I described the possible problems of working with agentic AI and explained why AI won’t be able to replace software engineers.
If you’re interested: Will AI Replace Software Developers?
The 70-80% blindly following trends observation feels right and it's not even always blind enthusiasm, sometimes it's just FOMO dressed up as strategy. Teams adopt agentic tools because everyone else seems to be, not because they've thought through what problem it actually solves for them.
The AI won't replace software engineers angle is one I mostly agree with though I'd frame it slightly differently: the engineers who understand where agents fail will have a significant advantage over those who only know how to use them when they work. That gap is going to matter more over time.
Will check out your article.
This part of Agentic AI cuts the hype out , to be honest I was just exploring , tried giving my small project access to AI agent ( used Cursor ) and said it to analyze it , now what we think is that this agent will look at each line of your codebase and learn it . but actually no agent does that there's a context limit which means when you say analyze this project the agent only creates a blueprint of your project structure and uses it.
The problem here - someone like me who's lazy , added validation schema's and type definations inside same file, the agent will think this only has type definations and models in it here .
this thing is happened with me which i recognize when the agent started suggesting me to add validation schema's and i was like what this thing does when it says "i've analyzed your entire project!".
That day I learned one thing that giving your entire project to AI agents is useless , instead we must share specific files and work on different modules one by one , this feels slow but this is actually best way to use AI Agents to avoid unwanted changes and db conflicts.
and I really Appreciate you for sharing this amazing article and clearing lots doubts about the hype
The context window blindspot is exactly it agents don't 'read' your project, they skim the structure and fill the rest with assumptions. Module-by-module is slow but it's the only way that actually works reliably right now.
What I've started doing: treat the agent like a new junior dev. You wouldn't hand a junior your entire codebase on day one. You'd give them one file, one task, one definition of done.
Same principle. Different tool.
Really appreciate the honesty here. Most "agentic AI" demos are glorified prompt chains that fall apart the moment they hit a real user environment. The gap between a polished demo and production reliability is massive.
That said, I think the problem isn't that agentic AI is impossible — it's that most implementations are trying to do too much autonomously without proper guardrails.
We've been building AnveVoice (anvevoice.app) — a voice AI that takes real DOM actions on websites (clicking buttons, filling forms, navigating pages). The key insight was constraining the agent to a well-defined action space with sub-700ms latency, rather than trying to be a general-purpose autonomous agent.
The overhyped version: "AI that does everything for you."
The version that actually works: "AI that does specific things reliably within tight constraints."
Great post — this is exactly the kind of honest conversation the industry needs.
That last line deserves to be quoted everywhere AI that does specific things reliably within tight constraints' is the most honest definition of working agentic AI I've seen.
The constraint-first approach is exactly what the industry keeps skipping. Everyone wants to build the general-purpose agent because that's what gets the funding and the press. Nobody announces 'we built a very reliable narrow agent' — even though that's actually harder and more valuable.
The fact that AnveVoice constrained the action space first and got sub-700ms latency as a result proves the point. Constraints aren't a limitation of the vision they're the engineering discipline that makes the vision real.
This comment should be required reading for every team currently in the 'why is our agent failing in production' phase.
AI continues to develop rapidly, but it does not seem possible for us to use it effectively in many areas or to obtain truly reliable and accurate results. Despite this, the sector keeps expanding as if it were a bubble, constantly exaggerated and overhyped. The fact that many developers cannot clearly foresee the future of the field will likely cause them to eventually hit a wall.
Fair points. I think the hype is definitely outpacing real world reliability right now. But at the same time, a lot of foundational tech (LLMs, tool use, etc.) is still maturing. Maybe instead of a wall, we’ll see a consolidation phase where only practical, high-reliability use cases survive. The bubble part I agree with too many demos, too few production-grade systems.
The 47-deleted-tickets story is the most honest agentic AI failure description I have read. The pattern where the agent acts confidently with zero confirmation prompts is exactly the gap most frameworks still ignore.
The zero confirmation prompts gap is the one that still surprises me when I look at how most frameworks are designed. The default behavior is action, not verification. You have to deliberately build in the pause it doesn't come included.
What makes it worse is that confidence without confirmation is literally the selling point in most demos. Watch it just handle everything is the pitch. The problem only becomes visible when everything includes decisions that should have had a human in the loop and by then the tickets are already gone.
The frameworks that do handle this well tend to be the ones built by teams who got burned first. It's almost a rite of passage at this point which is a terrible way for an industry to learn, but here we are.
I agree with this a lot. People keep treating polished agent demos like proof that the whole thing works in production, and to me that’s just not true. The tech is real, but a lot of the framing around it feels way ahead of the actual reliability.
Framing ahead of actual reliability that's the most precise way I've seen the problem described. The tech earns trust slowly through working systems. The framing earns attention fast through polished demos. They're running at completely different speeds, and the gap between them is where most projects get destroyed.
The demo problem is almost self-reinforcing every successful demo raises expectations, which leads to more ambitious deployments, which leads to more failures, which somehow leads to more demos promising it'll be different this time.
I don't think it's overhyped I think it's misused, and that's a different problem entirely.
Your ticket story is a perfect example of what happens when people hand autonomous control to a tool they don't understand yet. The agent didn't fail because agentic AI is broken. It failed because nobody defined the boundaries, the trust level, or what "done" actually means in that context.
TThere's so much coming out that people are grabbing everything that looks good without knowing what they actually need.
By the time they've tried all the candy in the store, they've got a toothache and nothing to show for it.
I've been building with agents for a while now. The ones that work aren't the ones doing everything, they're the ones doing one thing well, with clear handoff points back to a human. That's not a limitation, that's the design.
The problem was never the tools. It's that most people don't know how to evaluate them, scope them, or know when to stop.
So they run the demo, get burned in production, and call the whole thing overhyped.
The hype is real. But so is the technology. The gap in between is a people and process problem, not a tech problem.
Misused vs overhyped is a fair distinction, and honestly I agree with most of what you've written. The ones doing one thing well with clear handoff points that's exactly the working version I described in the article.
But here's where I'd push back: the hype isn't separate from the misuse. The hype is causing the misuse. When every conference talk, every pitch deck, every LinkedIn post is selling autonomous digital employees people don't misuse the tool by accident. They're using it exactly the way it was sold to them.
If the marketing said narrow, scoped agents with human checkpoints the misuse rate would be lower. The gap between demo and production isn't just a people problem. It's a framing problem that the industry is actively creating and profiting from.
The hype is real. But so is the technology. I'd add: and the hype is actively making it harder for the technology to succeed. That's what makes it worth calling out.
The "Black Box" of Autonomy: When Action Outpaces Intent
This is a sobering and necessary reality check. Your experience with the backlog isn't just a "bug" in a tool; it’s a fundamental collision between deterministic expectations and non-deterministic agency. We’ve spent decades building software that does exactly what we tell it to, but we are now shifting into an era where software does what it thinks we meant.
The Illusion of "Organization"
What strikes me most is your point about "organize" being a subjective command. In a system built on logical flows, "organize" isn't a function—it's a philosophy. When we hand that over to an agent without a shared mental model, we aren't delegating a task; we are delegating contextual judgment.
The agent treats your backlog like a static dataset to be optimized, whereas a human treats it like a living history of intent. When the agent deleted those 47 tickets, it wasn't "failing" its internal logic; it was maximizing a "deduplication" stream while being completely blind to the "historical context" stream.
The Orchestration Debt
The "Coordination Problem" you mentioned is the hidden tax of agentic design. We’re moving from writing code to managing digital ecosystems.
Ten agents interacting isn't just a "multi-agent system"; it's a high-entropy environment where the probability of an emergent, unintended "hallucinated workflow" increases with every new connection.
As you noted, the infrastructure required to make this safe (audit trails, rollbacks, state management) often outweighs the complexity of the task itself.
The Paradox of the "Decade of the Agent"
It’s fascinating that the industry's response to these failures is to extend the timeline rather than narrow the scope. We’re seeing a shift from Global Autonomy (the agent that manages everything) to Local Intelligence (the agent that does one specific thing with a human-in-the-loop).
The real value, as you've hinted, isn't in the autonomy, but in the augmentation. An agent that labels an issue is a tool; an agent that reassigns your team is a liability.
A Final Reflection
I wonder if the "Peak of Inflated Expectations" is actually a failure of our own language. We call them "Agents," which implies a surrogate with judgment. Perhaps if we called them "Probabilistic Automations," we would be much more careful about giving them the "Delete" key.
Thanks for sharing the "un-demoed" reality. It’s a vital reminder that the most sophisticated system in the room is still the one sitting in the chair.
The Black Box of Autonomy framing is exactly right and the distinction between does what we tell it and does what it thinks we meant is the one most teams don't realize they've crossed until something breaks. The backlog incident was precisely that collision. I said organize. The agent built a mental model of what organized means. Those two things never overlapped.
The Illusion of Organization point about contextual judgment delegation is the sharpest thing I've read on this. We frame it as task delegation, but we're actually offloading judgment to a system with no shared context, no history, no skin in the game. The agent didn't fail to organize it succeeded at optimizing a static dataset. The failure was mine for pretending those were the same thing.
Orchestration Debt is a term I'm going to use going forward. High-entropy environment where hallucinated workflows increase with every new connection that's exactly what nobody's pitch deck models.
The Global Autonomy vs Local Intelligence distinction is where I think the industry actually needs to land. Not agents that replace workflows, but agents that handle the one thing a human shouldn't have to touch with a human still holding the context above it.
Probabilistic Automations honestly, if that term had been in use from the start, the expectations gap wouldn't be nearly as wide. The name agent carries too much intent. It implies someone who represents you. Probabilistic automation implies a tool that needs supervision. Same technology, completely different mental model going in.
Thank you for this genuinely one of the best comments I've received on anything I've written.
You made an excellent point because it is so hard to keep up with the latest technology in AI these days.
That's exactly the trap the pace of AI releases is so fast that even practitioners can't keep up, let alone evaluate what's actually production-ready versus what's just impressive in a demo.
I think the real skill right now isn't knowing every new agent framework. It's knowing which ones are worth your time and that only comes from actually building and breaking things in production.
That is true
Exactly ❤️🔥
And yet you're Vibe Posting, that ironically delightful.
The irony is intentional, actually. I used AI to help structure the argument not to write it. The backlog incident, the 18 months of building, the failure patterns all mine. AI just helped me organize them more clearly.
Which is exactly the distinction the article is making. AI as a thinking partner with a human driving? That works. AI as an autonomous agent acting without supervision? That's what deleted 47 tickets.
Vibe posting with assistance is still vibe posting. 😄
And i'm guilty as charged as well. Since 2024 😅
dev.to/ker2x/the-great-web-ai-ensh...
Haha, at least we're honest about it. That puts us ahead of a $50 billion company that forgot to mention their base model. 😄
Will check out your article!