DEV Community

Cover image for Stop Building AI Agents! Start Building Real AI Software Instead
Isaac Hagoel
Isaac Hagoel

Posted on

Stop Building AI Agents! Start Building Real AI Software Instead

Every true revolution has its detours. For AI, the promise is real, something fundamental has shifted. But as the latest Gartner AI Hype Cycle shows, our first big bet, the age of autonomous AI agents, turned out to be the wrong turn, at least for now.

Why did we believe so strongly? The vision was intoxicating: describe a goal, hand your agent some tools, and let it handle the mess. No more manual wiring, no more business logic - the agent would figure it out, freeing us from tedious problem solving. For a while, it felt within reach; releases like o1, Claude-3.5-Sonnet, and GPT-4.1 were genuine leaps over what came before, unlocking new, reliable real-world use-cases. But then, momentum started to stall. Each new model (o3, GPT-4.5, Grok4, GPT-5) was hyped as the breakthrough, the moment when agents would finally work. Benchmarks inched higher, expectations soared.

But real-world builders know: the fundamentals barely changed. Agents still hallucinate, lose context, and require endless handholding. When long anticipated GPT-5 landed and the leap still didn’t arrive, the industry finally had to admit: the breakthrough wasn’t coming. Now, we’re firmly in the “trough of disillusionment” - the part of the cycle where the dreams get recalibrated and real progress starts.
image The Gartner Hype Cycle


Let me put it bluntly: if you still believe fully autonomous agents are here, show me a single agent working well in production outside of coding copilots. And even those coding agents are not really autonomous. They rely on constant back-and-forth with the user, who steers them and fixes their mistakes.

The revolution isn’t cancelled. It’s just moved. The way forward is not autonomy at all costs, but tight, explicit integrations where LLMs are used as powerful, controlled components, not left to run the show. The slope of enlightenment is right in front of us, but it starts with dropping the agent fantasy.

I’ve spent the past eight months in the trenches, shipping AI features in production, fighting with agent frameworks, and watching the same problems crop up again and again. Here’s the hard truth: agents sound good on conference slides, but in the real world they break, drift, and stall unless you babysit every step.

So what’s the alternative? If you want reliability, iteration speed, and real business value, it’s time to treat the LLM as plumbing, not the pilot.

The Allure of Agents & the Hype Machine

The agent gold rush wasn’t driven by developers alone. Three main actors shaped the frenzy, each with different bets and incentives. Framework makers, racing to build the glue and protocols for “autonomous” orchestration, were gambling on foundation model progress unlocking true autonomy. Hardware vendors and cloud platforms, from GPU makers to AI infrastructure startups, stood to profit from any paradigm that made AI workloads heavier and more ubiquitous. And then, of course, the foundation model companies themselves: OpenAI, Anthropic, Google, xAI, who fueled the optimism but quietly hedged their own bets, as we’ll see later.

This ecosystem of overlapping interests turned the agent vision from a technical hypothesis into an industry narrative. That narrative was powerful, intoxicating, and, for a time, felt inevitable.

The frameworks themselves grew more ambitious with each passing quarter. CrewAI, Swarms LangGraph, and a host of other libraries and protocols sketched out a future where countless agents could collaborate in harmony; delegating, calling each other, and weaving together complex workflows. Vendors rolled out “research agents,” “autonomous computer use agents,” and more, hoping to lead the new stack. Demo videos showed swarms of agents reasoning their way through multi-step challenges, and slide decks promised a world where “set it and forget it” would finally apply to enterprise software. The sheer volume of articles, guides, and open source projects made it feel like this new order was just around the corner. In reality, none of these vendor-driven agents took off in a meaningful way. The userbase rejected them.

Piston Engines, Paradigm Shifts, Benchmarks and AI Models

The story of today’s large language models closely parallels the golden age of piston engines in aviation. For decades, engineers genuinely believed that with bigger and more powerful piston engines and ever-better propeller designs, airplanes would keep getting faster - maybe even break the sound barrier. At first, every increment of horsepower seemed to open up new possibilities, but soon each leap delivered less and less. Eventually, the fundamental limits of the piston engine and propeller system became clear: pushing further would take a new paradigm. As historian Edward Constant puts it, “huge increases in engine horsepower were yielding smaller and smaller increases in speed” (grahamhoyland.com), and as late as the 1930s, many in the field still believed a breakthrough was just ahead (Britannica).

Yet, piston-engine planes were world-changing in their era - no one would call them a failure. They delivered most of what aviation needed for decades, just as LLMs now deliver astonishing, practical results across industries. We’re now at a similar point with LLMs. The transformer paradigm produced incredible breakthroughs, but simply scaling these models (even with reasoning) isn’t unlocking robust, general-purpose autonomy.

But Benchmarks...

Recent releases do bring new benchmark highs, but the leap forward for real-world use just isn’t materializing. Much of this is because benchmarks measure “in-distribution” skills: tasks that are close to what the models have already seen. LLMs shine here. But step out of that comfort zone toward unfamiliar, more complex, or genuinely multi-step work and the cracks show: hallucinations, brittle context, unreliable reasoning.

Researchers also noticed this benchmark disconnect. As this paper puts it:

The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability.

Another recent paper points to systemic limitations in the current benchmarking paradigm.


None of this diminishes what we have in front of us. The improvements we still see in speed, cost and marginal capabilities are real and worth celebrating. But the nature of our tools is now clear and accepting their boundaries is what will finally allow us to build the next generation of AI software, rather than waiting for paradigm shifts that might take many years to arrive.

How Agents Fail in Practice

What is an “AI agent”? In the context of LLMs, an agent is a system that uses an LLM to autonomously plan and execute multi-step tasks: breaking goals into actions, deciding what to do next, choosing tools or APIs, and iterating—ideally without human intervention (Anthropic).

Why do they fail in the real world?

  • Loss of control and premature exit: Agents often “think” they’ve finished before all parts of a task are truly complete. They may exit prematurely, get stuck in loops, or miss obvious next steps - especially as tasks get more complex, long or open-ended.
  • Strict complexity limits: As the number of instructions, tools, or task history grows, performance and reliability degrade sharply.
  • Hallucinated actions and broken chains: Agents frequently invent tool calls, take invalid actions, or misinterpret what’s needed—resulting in broken workflows or failure to deliver usable results.
  • Cascading errors: A mistake in one step can cause a chain reaction, with no robust recovery. The system can veer off course, miss the goal, or require manual reset.
  • Poor context management: Agents can’t reliably hold or use all relevant context as tasks grow longer, causing confusion, forgotten requirements, or inconsistent decisions.
  • Opacity and lack of debuggability: It’s often unclear why an agent did what it did. Debugging and reproducing failures is notoriously hard.
  • Human-in-the-loop dependence: For any non-trivial task, a human must step in to guide, correct, or “babysit” the agent. True autonomy almost never survives outside of narrow, simple demos.

And about those demos: Most agent demos are meticulously iterated until they perform a single showcase scenario perfectly. It’s easy to make a video of an agent executing a specific, well-groomed task. What’s hard and still unsolved is getting robust, reliable performance in the messy, unpredictable real world.

Remembering What Software Is Actually About

Software engineering is about taming complexity by breaking big problems into smaller, understandable parts. Each part is explicitly defined, with clear interfaces and predictable outcomes. We use modularization and encapsulation to create boundaries, making it possible to test, reason about, and improve each part independently.

Good software makes intervention points clear. You can trace data as it flows through the system, observe where decisions are made, and know exactly where to apply a fix or add new logic. Branching paths and possible system states are explicit—not left to be inferred from opaque behavior.

With strong boundaries and transparency, you maintain control as your system grows. This discipline is what makes it possible to build reliable, scalable software - no matter how ambitious the project.

Workflows, APIs, and the Real Path Forward

If you work on AI applications, you’ve heard "agentic workflows" described as a lesser, “un-evolved” version of agents - maybe even a stopgap until agents get smart enough. The Anthropic team and others have framed agents as the natural next step, capable of solving more open-ended or complex tasks than traditional, stepwise workflows.

But that view is backwards. In real-world scenarios, workflows can be dramatically more powerful, reliable, and extensible than agents. A well-designed workflow gives you granular control and intervention points at every boundary. Each state and branch is mapped, explicit, and testable. Workflows enable visibility, debuggability, and precise engineering at scale.

This pattern isn’t new. For decades, software has relied on orchestrating external APIs and services. From the application’s perspective, these are “magic” - but they always speak in contracts, return well formed responses, and fit precisely into the broader workflow. Why not treat LLMs in exactly the same way? Give the model a narrow, well-scoped job, validate the output, and plug it into your workflow like any other high-powered component.

We’re lucky that API model makers like OpenAI, Anthropic, and Google have quietly given us the tools we need for this approach. Their APIs provide ever improving structured output modes with type and schema enforcement, temperature and randomness controls, and even regex based constraints on outputs (OpenAI Structured Outputs). These features let us treat LLMs as reliable, predictable, and tightly controlled components - like any other production API.

Agents (single or a group of them talking to each other however the please) promise more adaptability but fail to deliver. They sacrifice the very things that make production software good.
Our goal shouldn't be about chasing autonomy for its own sake but about delivering high quality, never possible before apps and features to our users (who couldn't care less about AI btw).

Workflow vs. Agent: The Flight and Hotel Booking Test

Let’s use the familiar (though contrived) “Book a flight and hotel” scenario—often showcased as the ultimate test for agent frameworks. The examples below are simplified for educational purposes.

Agent Approach

Setup:

  • Tools provided:

    • search_flights(origin, destination, date, arrival_time)
    • book_flight(flight_id, passenger_info)
    • search_hotels(city, checkin_date, checkout_date, near)
    • book_hotel(hotel_id, guest_info)
    • send_email(recipient, content)
    • say_to_user(message)
    • planning tools like add_todo, update_todo, read_todo
  • Prompt:


    A long, detailed instruction set describing the user goal, each tool and its parameters, usage rules, and task-specific edge cases. Typically tousands of tokens just for context.

In theory:

The agent receives the user goal (“Book me a flight to Berlin on May 22nd, arriving before noon, and a hotel for two nights near the conference venue”), then plans and executes—deciding which tools to call, in what order, and how to handle ambiguity, branching or errors.

What actually happens:

  • The agent may exit after booking a flight, skipping the hotel, or vice versa.
  • It often ignores critical constraints (“arrive before noon”), or invents data not in the API responses.
  • It can take destructive actions like booking the wrong flight.
  • With each added tool or requirement, prompt and tool complexity grows, and error rates rise.
  • Debugging or extending the workflow means rewriting prompts, retraining, or adding brittle heuristics or begging the agent to do the right thing.

The A-ha Moment

But wait, when you look closely, booking a trip, like most business tasks, doesn’t actually require AGI-level flexibility or deep autonomy. It’s a highly structured problem, following a predictable set of steps towards a successful completion. The apparent complexity comes from edge cases and details, not from open-ended reasoning. When you break it down, almost every aspect can be handled with clear logic, explicit checks, and a series of well-defined handoffs. The “magic” is in the composition, and the need for intelligence can be limited within clear boundaries.

Workflow/LLM Integration Approach

This approach is about code-first orchestration. LLMs are used only for well-defined, bounded tasks and for focused decision making, never for orchestrating.

Example Workflow

  1. Parse user input (only if using a chat interface, which is often unnecessary)
    • Code calls LLM with a schema-enforced prompt:
        {
          "result": {
            "type": "success",
            "parsedParams": {
              "destination": "Berlin",
              "arrival_date": "2024-05-22",
              "arrival_time": "09:30",
              "nights": 2,
              "venue_address": "Berlin Congress Center"
            } 
          }
        }
        // or (via Zod union type or Pydantic)
        {
          "result": {
            "type": "missing_info",
            "feedbackToUser": "What city should I book the hotel in?"
          }
        }
Enter fullscreen mode Exit fullscreen mode
  • Vendor side schema validation (a.k.a Open AI's structured outputs) guarantees outputs are always structured and complete.
  1. Branch: Request missing info or continue
    • Code inspects the response, if we got "missing info" - prompt the user with the feedback, when the user responds feed the full exchange into the first LLM again. If "success", move to search.
  2. Prepare search parameters & call APIs
    • Code builds params for booking APIs (both for exact matches and for alternate/flexible options if needed).
    • Code, not the LLM, calls these APIs in parallel and collects results. No hallucinations possible
  3. LLM-powered ranking (tightly scoped)
    • Code sends the original query and API results to an LLM.
    • LLM returns a sorted list of candidate options, with justifications, in a strict schema (notice that the output only contains ids, even though we provided the full details as input - no need for the LLM to repeat information we already have):
        [
          {
            "id": "LH1234",
            "rank": 1,
            "explanation": "Arrives before noon, direct flight, good price."
          },
          {
            "id": "LH5678",
            "rank": 2,
            "explanation": "Slightly later arrival, lower price."
          }
        ]
Enter fullscreen mode Exit fullscreen mode
  • IDs are validated against the real API results. Again - no room for hallucinated data.
  1. Present the best options to the user
    • Present the results to the user
    • Branch - the user is unhappy - provides additional requirements - we go back to the first LLM with the full exchange
    • The user selected one option and now we need to Book the flight - this can be done by code now or later
  2. Booking a hotel
    • Repeat a similar process but use the LLM to smartly select a city, a location in the city and create search filters based on input from the user or information we have about the user (e.g. past preferences) 

Key Points:

  • LLMs do not orchestrate—they handle bounded parsing or ranking jobs, always with explicit, code-enforced contracts.
  • All critical state and business logic stays in code—clear, testable, and maintainable.
  • Every failure point is observable and recoverable.
  • Each LLM call can have an extremely detailed, targeted prompt and efficient usage of its context window

Contrast with agentic frameworks:

Many “agentic workflow” frameworks (like LangGraph) promote chaining LLM calls, connecting nodes in a graph, and using prompt chaining utilities as the main pattern. Even when the LLM isn’t making every decision, the framework subtly encourages LLM centric designs that don't look like good old procedural logic. In other words, most current frameworks push you toward agentic complexity. The truth is that if you follow the code-first pattern above, you may find you need very little extra scaffolding or abstractions these frameworks offer.

Caveats & Legitimate Exceptions

This isn’t dogma - the programming model I described above allows tightly-scoped "agentic" loops, retries with feedback etc.

1. Tightly Scoped Feedback Loops (scoped decision making):

When generating artifacts like SQL queries with an LLM, the code executes the query. If it fails, the error (or even successful results) are sent back to the LLM for revision or validation of the queries, with a limited number of retries. The query results can then be handed off to code or another LLM for processing. These loops are bounded, schema-driven, and transparent. They’re pragmatic error recovery, not open-ended “agentic” wandering.

2. Multi-Step Reasoning Without Agentic Loops:

Some tasks, like analyzing a massive file or gradually tracing logic, seem to call for agents because they can't be done effectively in a single pass. But the new generation of reasoning models (think o3, Deepseek R1) can often handle such complexity internally, in a single, well-structured prompt. The LLM gets everything it needs up front and processes as much as it wants via multiple reasoning steps, until it is ready to return a single output that your code can validate and act on, no agentic looping required.

3. Tools/ Function Calling:
Yes - In some very complex scenarios you would have to pass tools and let the LLM decide on the exact calls to make before returning with the desired output. It's unlikely these scenarios exist in your app (😉) but - If you can't avoid it, remember to keep the agent in question as small and focused as possible and to minimise the number of tools you give it.


Summary:

Don’t avoid feedback loops or multi-step reasoning. Just keep them bounded, schema-driven, and under the control of explicit code.

Wrapping Up (and What’s Next)

The AI revolution is real but building robust, production-ready AI software means letting go of the agent fantasy and returning to the fundamentals that made software engineering great. Tightly controlled, code-centric integrations win every time.

There’s still a missing piece: the right tools, patterns, and principles for this new paradigm barely exist outside of the heads of a few experts. The current ecosystem is immature, and most frameworks still push us toward agentic complexity. Defining and building a real code-first stack and a shared set of “AI software engineering” principles—will be the next challenge for our community.

I’ll be sharing more practical patterns, tooling ideas, and principles in upcoming posts. If you’re building in this space, let’s connect. Disagree with me? Seen agents work in the wild? Share your stories in the comments or reach out.


Top comments (1)

Collapse
 
robinpapa profile image
Robin Papa

I think you're onto something. Just like you, I believe that LLMs as part of workflow patterns can be useful for interpretation of less deterministic input.

But maybe, at some point, those more deterministic workflows will be used by agents, because they can interpret the data points of the real world faster than we can, perhaps.

Or maybe agents just need more digital twins, who knows. Great write-up nonetheless.