Masaaki Hirano

Posted on Dec 10, 2025

The Hell of AI Agent Development

#ai #llm #discuss #blogging

This isn't engineering. It's prayer.

AI agent development right now is a muddy hell.

The prompt that worked perfectly yesterday lies to your face today.
The framework labeled "cutting edge" will be called "legacy" next week, and its maintainers will disappear the month after.
And the more you look at logs to debug, the more you realize you're wasting your life just trying to guess the LLM's "feelings" instead of checking your logic.

Do you know the web development scene around 2005 where people screamed "Explode, IE!"?
The chaos here makes that look cute.

For the lucky Gen Z devs whose biggest trauma is a React hydration error, let me explain the vibe:

Your model is a moody teenager. It might run your logic, or it might decide to write a poem about strawberries instead. You don't know until you pay the API bill.
Debugging is a gacha game. You pull the lever (run the prompt), spend $0.10, and hope you get a 5-star response (valid JSON). Usually, you get a 1-star trash item (a hallucination).
Documentation is a folklore. The only "docs" are a 6-hour-old tweet from the founder, a Discord channel where everyone is just screaming "LFG", or an undecipherable Arxiv paper written by someone who has never left the lab.

The Definition of "Agent" Died Three Times

On November 30, 2022, ChatGPT appeared, and we all happily jumped into the fire.

2023: The Year of Pointless Optimization
We called them "ReAct." We felt like wizards tuning prompt chains.
We spent hundreds of hours fine-tuning chunk sizes for Vector DBs. We debated "RecursiveRetrieval" vs "ParentDocumentRetrieval" as if it was religious dogma.
We thought we were building assets. We were just building sandcastles that would be washed away by the next model release.

2024: The Year the Credit Card Died
Cline and Manus arrived. The definition changed from "answering a question" to "completing a task."
Great, right? No.
The agent would get stuck in a loop, fixing the same line of code for 4 hours while you slept.
You woke up not to a finished feature, but to a $500 OpenAI bill and a git repo in a detached HEAD state.
We realized "autonomy" often just meant "autonomous money burning."

2025: The Year "Legacy" Became a Daily Event
Computer Use and Claude Code pulled the trigger. Agents now control the browser and terminal directly.
RAG pipelines we built in 2023? Dead.
The custom tool definitions we agonized over in 2024? Dead.
I personally led a heavy RAG project. I spent months optimizing retrieval key-value stores. In one morning update, a model with a massive context window and direct filesystem access turned my life's work into "legacy bloat."
I watched my code rot in real-time. It wasn't just obsolete; it was embarrassing.

Now, top-runner agents rewrite the rules daily. The definition of an agent isn't something you learn anymore. It's something that gets forced upon you while you're trying to fix yesterday's breaking change.

The Reality on the Development Floor

While the industry burns outside, inside the office, we are drowning.

The Client's "Just make it like ChatGPT"
Clients see Sam Altman's tweet and say, "Can we just add this?"
They don't know that the "this" they want requires a complete re-architecture of the agentic loop we spent 3 months building.
The benchmark we targeted 6 months ago is now considered "dumb." You built a calculator; they now want a mathematician.

The "SOTA" Curse
Is the internal tool you built with LangChain in early 2024 still running?
Was that complex CrewAI multi-agent swarm actually better than a single well-prompted Claude 3.5 Sonnet call?
Be honest.
Most of our "cutting-edge" projects from 6 months ago are now just... awkward. Like a flip phone in the iPhone era.

The 95% Failure Rate
MIT says 95% of corporate AI investments fail.
We know why. It's not because the tech is bad. It's because by the time you deploy, the tech is gone.
We are building bridges while the river keeps moving 5 miles downstream every week.
The result is always the same: "Abandon" or "Rebuild." There is no "Maintain."

Prompts Bloat and No One Touches Them

There is also a technical hell.

The funny thing about prompts is that they rot faster than milk in the sun.

You add an edge case here, an XML tag there, and a "please don't hallucinate" prayer at the bottom.
Suddenly, your sleek system instruction looks like a conspiracy theorist's manifesto.
And the LLM? It starts acting like a tired intern on a Friday afternoon. You ask for JSON, it hands you a poem. You ask it to check the docs, it hallucinates a library that doesn't exist. It ignores your all-caps instructions because it got fascinated by a typo in line 400.

But the technical reality is serious. Context Rot.
Research shows the "Lost in the Middle" phenomenon is real. As token count increases, retrieval accuracy drops. The Inverse Scaling Law kicks in.
A 200k context window is meaningless if the effective attention span is only 8k. It is a fundamental trade-off between capacity and precision that no amount of prompt engineering can fully solve.

And then? No one touches that prompt anymore. It becomes "The Sacred Text." It is legacy code, but worse—it's legacy code that costs money every time you run it.

Frameworks: A Bullet Train to the Graveyard

Even more troublesome is framework dependency.

Can you port an agent written in LangChain to Mastra?
No. You're not just porting code; you're rewriting your entire mental model of how an agent thinks. The prompt structure, tool definitions, and memory handling are all tightly coupled to the framework's opinion.

When you uv add an agent framework, you aren't just adding a dependency. You are buying a ticket on a high-speed train with no brakes.
You are determined to die with that framework.
And frameworks do die. Or worse, they "evolve."
How many of you felt a piece of your soul wither away during the migration from LangChain 0.1 to 0.2? We know exactly what "breaking change" means here. It means "rewrite everything or stay on the vulnerable version forever."

Accumulated assets—prompts, tool designs, agent configurations—lose their value along with the death of the framework.

But is that struggle just a developer's problem?

If you switch frameworks, the agent's behavior changes. If you rewrite the prompt, the tone of the response changes. Features that were usable yesterday disappear today due to "spec changes."
From the user's perspective, that might be a "deterioration" rather than an "improvement." Familiar operations suddenly change, and expected responses no longer return.

We developers call it "paying off technical debt." But for users, it's simply "it got harder to use."
Framework convenience, model convenience, development team convenience—everything is passed on to the user experience.
This is the moment when the hell of agent development spreads to the user.

Naturally, we know this risk, so we can't easily make a move.
It is clear that there is a much greater risk of change than when writing REST APIs on a framework.
It's no longer about choosing a framework from a software perspective, but choosing which coffin is comfortable to lie in.

Security? You Mean "RCE as a Service"?

And then there is the production environment. Multi-user. Security.

Let's be honest about what we are building. We are essentially building Remote Code Execution as a Service.

We are giving a probabilistic token generator—which we know hallucinates—the ability to execute shell commands, read files, and access the network.
In any other context, this would be a security nightmare. A critical vulnerability designed by us.
But in 2025, if you don't allow this, your agent is "dumb." If you restrict permissions, it can't do the job.

So you are forced to choose:

Build a Crippled Agent: Safely useless.
Build a Security Time Bomb: Use standard frameworks that run agents inside your app process. One prompt injection, and your server is gone.
Become a Platform Team: Build custom sandboxes, gVisor clusters, and ephemeral VMs just to run a single logical loop.

"Just use a sandbox."
Easier said than done. Are you ready to maintain a complex infrastructure layer just to safely let your agent read a text file?

Is There a Way Out of Hell?

If you've read this far and your eye is twitching, you are not alone.
Welcome to the club. The coffee is stale, the Jira tickets are endless, and the hell is just getting started.

The problem isn't the technology. It's that we haven't agreed on the boundaries.

Remember why IE6 was a nightmare? Because Microsoft tried to bundle the browser with the web.
If you used their browser, you had to use their non-standard tags.
That is exactly what is happening today. Every LLM provider and framework author is trying to verify their own "IE6" ecosystem.

We need to define the "HTML" of agents. We need Separation of Concerns before we all go insane:

Model ≠ Agent: Your logic shouldn't break just because you switched from GPT-5 to Claude 4.5.
Framework ≠ Agent: Your prompts and assets are yours. They shouldn't die just because LangChain released v2.0.
App ≠ Agent: The agent is the worker. The app is just the office. Don't weld the worker to the desk.

We did this for the Web. HTML for structure, CSS for style, JS for logic.
Because we drew those lines, the ecosystem exploded.
Until we draw these lines for Agents, we are just building legacy code on a treadmill. And that is why we are in hell.

Where Are We Heading?

Honestly? The hell will continue until we stop accepting it.

It took 10 years for the Web to get standard HTML5. We are in year 3 of LLMs. We are barely at the starting line.

But history is clear on one thing: Lock-in always loses.
IE died. Flash died. Proprietary walled gardens eventually get demolished by Open Source bulldozers.
The winner isn't decided by Sam Altman or Dario Amodei. It is decided by us.

Every time you choose a framework, you are casting a vote.
Every time you contribute to an open standard, you are building the exit door from this hell.
The ecosystem is shaped by what we use, not what they sell.

So, what can we grassroots developers do?
Actually, we hold all the cards.

Build, Choose, or Speak Up.

If you Build: Isolate.
Write code that doesn't care which LLM it's talking to. Write prompts that don't care which framework runs them. Don't wait for a standard; be the one who creates the boundary.

If you Choose: Resist.
Convenience is the bait. Lock-in is the hook. Don't bite.
Choose tools that let you leave. If a framework demands 100% of your soul, give it 0%.

If you Speak Up: Unite.
Your struggle is not unique. Share your "hell." The only thing proprietary vendors fear is a united developer community saying "No."

It is December 2025. I am still in hell.
But looking around, I see I am not burning alone.

The "Top Runners" are done with their releases for the year. They are probably heading to Hawaii.
We, on the other hand, have hallucinations to debug and breaking changes to fix.

Enjoy the holidays. Try to disconnect.
We have a massive amount of "legacy" code to rewrite in 2026.

See you in hell.

Your turn: What’s your AI agent hell story? Share it in the comments.

Top comments (2)

Devon Kelley • Dec 11 '25

This “hell” isn’t just pain, it’s data. Every broken run, every loop, every hallucinated detour is a signal that agents should be learning from. The collapse point isn’t model drift or framework churn. It’s that we still have zero runtime intelligence. Agents operate blind, and devs are stuck playing postmortem archaeologist with logs and optimization.

Until agents can observe themselves, learn from their own failures, and adjust execution paths in real time, this cycle repeats forever.

kxbnb • Jan 20

The "RCE as a Service" section resonated hard. The agent security trade-off you describe - "crippled agent" vs "security time bomb" vs "become a platform team" - is exactly why we built keypost.ai. Instead of sandboxing everything, we enforce deterministic policy at the tool-calling boundary: rate limits, access controls, cost caps. Swap one URL and the guardrails are in place.

And for the debugging hell - Devon's comment about "zero runtime intelligence" nails it. We've been working on toran.sh for that exact problem: see what the agent actually sent to each API and what came back, in real-time, without instrumenting your code.

The "boundaries" framing you end with is spot on. The mess isn't just chaotic prompts - it's that we're treating agent tool calls like trusted function calls when they're really untrusted external traffic. Once you enforce that boundary explicitly, a lot of the chaos simplifies.

Curious if you've seen similar patterns with Perstack - especially around the MCP standardization you wrote about in your other post?