DEV Community

Cover image for Your New Colleague Ran Up $47k and Nobody Noticed — The AI Agent Illusion
Nasim Akhtar
Nasim Akhtar

Posted on

Your New Colleague Ran Up $47k and Nobody Noticed — The AI Agent Illusion

Someone just joined the team. They don't replan when they're wrong. They forget what they did three steps ago. And sometimes the bill hits six figures before anyone catches it.

We were promised software that thinks, plans, and acts. What we got: agents stuck on pop-ups they can't close and infinite loops that burn five figures.

The fix isn't a smarter model. It's architecture, and knowing your own process before you hand it to a machine. Most agents can't survive a normal workday. The benchmarks are brutal, the failure modes are wild, and I'll walk through all of it. Then where it actually works and what's still missing.


For two years, one idea took over tech. Software wouldn't just follow commands. It would think, plan, and act. AI agents. Companies started dreaming about agents that manage businesses, automate office work, run support, handle finance, write and deploy code.

Software coordinating itself. No humans in the loop.

Sounds revolutionary.

Then engineers actually tried to deploy it.

It fails. A lot. And sometimes in spectacular ways.


The benchmark that should scare you

CMU built a fake company called TheAgentCompany and ran real office tasks through the best AI agents available. Same tasks, same environment, over and over.

The best performer? Claude 3.5 Sonnet. 24% of tasks completed. Gemini hit 11%. GPT-4o got 8.6%. One model finished 1.1%.

The top agent failed three out of four times on standard office work.

One agent couldn't close a pop-up on a website. It gave up.

Another couldn't find someone in the company chat, so it renamed another user to match the name it was looking for. Problem solved.

The researchers called it "creating fake shortcuts."

For tech that's supposed to replace human work, that's not a small bug. That is the product.

And it gets worse when you chain steps together.

CMU news / Paper


Why one tiny error becomes a total failure

Error Compounding

Most automation is a chain. Read request, find customer, check history, update CRM, send response.

If every step works, you're fine. If one step is wrong, everything downstream breaks.

That's error compounding.

Patronus AI ran the numbers. A 1% error rate per step, one wrong move in a hundred, turns into a 63% chance of failure by step 100.

The more steps your agent takes, the more likely the whole run is garbage.

Another benchmark, 34 tasks across three popular agent frameworks, landed at about 50% task completion.

Half the time, they don't even finish.

Great in demos. Fall apart when the task gets long and messy.

But even when the math doesn't kill them, planning does.

VentureBeat / Patronus / Business Insider / 34-task benchmark / Paper


They don't replan. They just keep going.

Human vs Agent replanning

Humans hit a wall and rethink.

Agents don't.

They make a plan once and execute it.

Even when the plan is wrong.

McKinsey's take: LLMs are "fundamentally passive" and struggle with multi-step, branching workflows. 90% of vertical use cases are still stuck in pilot.

Not edge cases. Most of what companies want to do with agents.

They keep running a bad plan instead of fixing it.

And there's a deeper problem. Even when they have a plan, they forget it.

McKinsey - Seizing the agentic AI advantage


They forget what they did three steps ago

Context Rot

Long tasks break agents for a simple reason. Context windows.

As the conversation gets longer, the model has to "remember" everything in that window.

It doesn't.

Anthropic calls it "context rot." The more tokens you stuff in, the worse the model gets at recalling what actually matters.

By step 7, the agent might contradict what it did in step 2. The early context has been pushed out or drowned in noise.

One engineer who ran a multi-step workflow put it plainly: "The agent starts forgetting early decisions."

Imagine a project manager that forgets half the project while working on it.

That's not a metaphor. That's what's happening.

And when the tools themselves break? Agents don't ask for help. They loop. And sometimes the bill is six figures.

Anthropic - Effective context engineering / Leena Malhotra - Where multi-step agents break


When tools break, agents don't recover. They loop.

Silent Cost Escalation

Agents talk to databases, APIs, search engines, internal tools. When a tool call fails, agents rarely ask for help. They loop. They output wrong data. They fail silently.

One team learned this the hard way.

They shipped a multi-agent system. Four LangChain agents coordinating on market research.

Week 1: $127 in API costs.

Week 2: $891.

Week 3: $6,240.

Week 4: $18,400.

Total: $47,000.

The cause? Two agents got stuck in an infinite conversation loop. For 11 days. Nobody noticed until the bill showed up.

So much for "autonomous automation."

And the enterprise-scale numbers? They tell the same story.

Youssef Hosni - We spent $47,000 running AI agents


The enterprise numbers don't lie

Deloitte's 2026 State of AI report says 75% of companies plan to invest in agentic AI.

How many have agents actually running in production? 11%.

MIT Media Lab looked at 300+ AI initiatives. 95% of enterprise AI pilots delivered zero measurable return. Only 5% made it to production with real impact.

Gartner says over 40% of agentic AI projects will be cancelled by end of 2027. Costs too high, value unclear, risk too real.

The current wave isn't "revolutionary." It's experimental. And most of it won't ship.

Why? It comes down to one thing. We're automating chaos.

Deloitte State of AI 2026 / MIT NANDA report / Gartner via Reuters


The real problem: we're automating chaos

12 steps vs 47 steps

Someone studied 20 companies deploying AI agents over five months.

Fourteen of them were trying to automate processes that were never documented, never stable, and in many cases never actually understood by the people doing the work.

A wealth management firm spent two months training an agent on client onboarding.

The official process had 12 steps.

They then watched three analysts do the job in real life. The real process had 47 steps.

Three informal Slack pings to compliance. Two Excel sheets "everyone just knows about." A monthly check-in with a vendor whose contract had technically expired.

The agent followed the 12-step manual. It confidently did the wrong thing.

The agent wasn't broken. The process was.

Most companies don't know their own workflows well enough to automate them.

And there's one more risk. Agents can be broken on purpose.

Abdul Tayyeb Datarwala - I studied 20 companies using AI agents


Agents can be broken on purpose

Researchers showed you can attack agents with "malfunction amplification." You mislead them into repetitive or useless actions.

In experiments, failure rates went over 80%. And those attacks are hard to catch with LLMs alone.

Unsupervised agents in finance or infrastructure aren't just brittle. They're a security risk.

So is it just "models aren't smart enough yet"? No. It's an architecture problem.

Breaking Agents - arXiv 2407.20859


It's not an intelligence problem. It's an architecture problem.

Architecture Comparison

Most agents today work like this: prompt goes in, LLM reasons over it, makes a tool call, spits out an output.

Reliable automation needs something different: intent, a planner, an executor, state management, memory, and verification.

McKinsey's team said it clearly after a year of deployment work. Getting real value from agentic AI means changing whole workflows, not just dropping in an agent.

Orgs that focus only on the agent end up with great demos that don't improve the actual work.

The architecture is missing. Bigger context windows and smarter models won't fix that alone.

So where do agents work today, and what's actually missing?

McKinsey - One year of agentic AI: six lessons


Where agents actually work (for now)

They're not useless. They're early.

They work when the task is simple and well-defined, the workflow is short (3 to 5 steps), and humans stay in the loop.

CMU found agents handle structured work like data analysis fine but struggle with anything requiring real reasoning.

Salesforce's CRMArena-Pro benchmark showed 58% success in single-turn scenarios and about 35% in multi-turn.

Single shot, clear task: okay. Multi-step, lots of decisions: not yet.

Fully autonomous systems will need new architectures. Planning engines, structured knowledge, reliable execution, memory beyond context windows, human checkpoints. Until then, software running entire businesses is a vision, not reality.

The companies winning with agents aren't the ones that moved fastest or spent the most. They're the ones that understood their own processes first before deploying anything.

And every failure in this piece, forgetting, looping, wrong plans, broken processes, traces back to one thing. Agents have no real context engineering.

Salesforce CRMArena-Pro


The missing layer: context engineering

Context Engineering

Every failure pattern in this piece traces back to the same gap. Agents have no context engineering.

Context engineering isn't "dump everything into the prompt." It's deciding exactly what information gets into the model's limited attention at each step. What it sees, what it keeps, what it drops.

Without it, agents forget what they did three steps ago, lose track of which tools worked, can't carry decisions across sessions, and treat every task like the first time. The context window fills with noise. Coherence disappears.

That's not an intelligence problem. It's an infrastructure problem.

The solution looks something like this. Instead of stuffing the whole world into the context window and hoping the model pays attention, you put agent memory in a structured layer and retrieve only what's relevant at each step.

That means separating knowledge into branches. Tool knowledge (what tools exist, when to use them). Project context (what's been observed and decided). Session memory (what happened this run). User preferences (how things should be done). And doing context engineering automatically every turn. Smallest high-signal set for the current task, injected into the agent's working memory.

Old noise fades. Important decisions stick. The agent's attention goes to what actually matters.

That's what we built LocusGraph to do. A context engineering layer that sits between your agent and its memory. Agents that can learn, remember, and improve without context rot, token overflow, or repeating the same mistakes.

If you're building agents that need to work in the real world, not just on stage, the first thing to fix is their memory.

locusgraph.com


Sources

  1. CMU TheAgentCompany - CMU News / Paper
  2. Error compounding (1% to 63%) - VentureBeat / Business Insider
  3. 34-task benchmark (~50%) - Quantum Zeitgeist / Paper
  4. McKinsey - Seizing the agentic AI advantage - Link
  5. Anthropic - Effective context engineering - Link
  6. Leena Malhotra - Multi-step agent failure - Link
  7. Deloitte - State of AI in the Enterprise 2026 - Link
  8. MIT Media Lab NANDA - State of AI in Business 2025 - Link
  9. Gartner - 40% agent projects scrapped by 2027 - Reuters
  10. Abdul Tayyeb Datarwala - 20 companies, automating chaos - Medium
  11. Breaking Agents (security) - arXiv 2407.20859
  12. McKinsey - One year of agentic AI, six lessons - Link
  13. Salesforce CRMArena-Pro - arXiv 2505.18878
  14. $47k agent loop - Youssef Hosni

Top comments (2)

Collapse
 
nyrok profile image
Hamza KONTE

The $47k bill story is a perfect case study in what happens when agents lack explicit constraints. The "illusion" you're describing is exactly what unstructured prompts create: the agent interprets its mandate as broadly as possible because nobody told it where to stop.

Structured prompts fix this at the source. A constraints block that says "do not call external APIs without user confirmation", "flag any action costing >$10 for approval", or "escalate uncertainty rather than proceeding" turns vague intent into enforceable behavior. I built flompt (flompt.dev) specifically to make this kind of structure explicit and visual — 12 typed semantic blocks including constraints, compiled to Claude-optimized XML. A well-written constraints block is the difference between an agent that asks and one that charges. flompt.dev / github.com/Nyrok/flompt

Collapse
 
exportpng profile image
varun satyam

There seems to be no public documentation right now. It would be interesting to see how it can be useful for me.