DEV Community

Cover image for Why Your AI Agents Are Failing: The Routing Problem Nobody Is Solving
Arleen Kaur
Arleen Kaur

Posted on • Originally published at linksft.com

Why Your AI Agents Are Failing: The Routing Problem Nobody Is Solving

AI Disclosure: This post was written with AI assistance and has been reviewed and approved for publication by the Linksoft Technologies team.

Everyone's racing to deploy AI agents. Speed creates the illusion of progress, but it doesn't guarantee advantage. The real cost shows up later — in how the system behaves under load.

98% of enterprises are running AI in some form, 83% say cost is a top priority but aren't solving it architecturally, $500B annual gap between AI infrastructure spend and realized revenue

Read those three numbers together. Almost every enterprise is running AI. Most say cost efficiency is a top priority. And almost none have built the AI agent architecture layer that would actually solve it. That's the defining infrastructure gap of this moment.

The conversation in most strategy decks is still stuck in the wrong place: which model to pick, which vendor to trust, build or buy. Surface-level. Symptom-chasing. Completely missing the structural problem underneath.

Companies running AI at real scale aren't running better models. They're running better systems around models. That's the difference most teams still miss and it usually shows up in the budget later.

The Instinct That's Costing You

When organizations get serious about AI, the instinct makes sense. Use the most capable model available. It reasons best, handles ambiguity best, writes best. So you build your first agent on GPT-4 or Claude Opus or whatever tops the benchmark table and it works. Impressively, even.

Then you try to scale it. That's where the math gets uncomfortable.

Large frontier models are built for complexity. But most tasks in any real-world AI pipeline aren't complex. They're repetitive, narrow, and structurally simple. When you route everything through a hundred-billion-parameter model, you're paying for capability you don't need, latency you don't want, and token counts that scale linearly with volume.

Table showing what your pipeline actually looks like: classification, extraction, and routing decision tasks have no genuine complexity but are routed to frontier models by default; multi-step synthesis is moderate complexity; ambiguous reasoning is high complexity and appropriately frontier

Google Research's work on Switch Transformers documented up to 7x gains in pre-training efficiency with the same compute, proving these aren't theoretical. The question is whether your orchestration layer is built to capture them.

Sequoia Capital's analysis points to a $500B annual revenue gap where infrastructure investment dramatically exceeds realized returns. Getting model routing wrong isn't just an efficiency concern. At scale, it turns into a margin problem.

The Architecture Is the Problem

The default approach produces a flat pipeline: one input, one large model, one output, repeat. No routing. No complexity awareness. Every task treated identically regardless of what it needs.

In a proof of concept this works fine. At scale, the cost problem stops being abstract and by then the architecture is already too embedded to change easily.

Architecture comparison table: flat architecture uses one frontier model for everything with costs that scale linearly and budget spikes invisible until scale; tiered orchestration matches model to task complexity with 2 to 7 times lower per-task cost on routine work and routing errors that are observable and correctable

The pilot looks fine. Production is where things start to break and that's the trap most scaling teams walk into.

What Is Model Routing in AI and Why Does It Matter?

Model routing is the orchestration layer that decides which AI model handles which task. It sends complex, ambiguous requests to large frontier models and simple, repetitive ones to smaller, faster, cheaper models.

Without it, every task gets routed to the same model regardless of what it actually needs. You pay frontier-model prices for work a fraction of the cost could handle equally well.

At scale, that's not an efficiency gap. It's a margin problem. Model routing is what closes it by matching compute to complexity the same way a hospital matches patient complexity to the right tier of care, rather than routing every case to the senior specialist.

What the Fix Actually Looks Like

Think of it like triage in a hospital. You don't route every patient with a minor injury to your most senior specialist. You have a system that matches people to the right level of care, reserving specialist time for cases where their expertise is genuinely irreplaceable.

Your large model's compute is the specialist's time. The orchestration layer is the triage system. Without it, you have queues, waste, and costs that don't hold at scale.

Step by step diagram of how a tiered router works: Step 1 task intake and feature extraction at near zero cost, Step 2 complexity classification using lightweight model or rules layer, Step 3 routing decision sending routine tasks to small models and complex tasks to frontier, Step 4 bounded auditable action with full observability, Step 5 feedback loop and recalibration making it a self-improving system

"The key isn't just about choosing the cheapest option, but about finding the right recipe of tools and services that aligns with your workload patterns."
-- Google Cloud

How to Design Efficient AI Agent Architectures for Enterprises

Efficient enterprise AI agent architecture is built in tiers:

Tier 1 Lightweight model: Handles narrow, high-volume, structurally simple tasks
Tier 2 Mid-tier model: Handles moderate reasoning and mixed-complexity requests
Tier 3 Frontier model: Reserved for genuinely complex or high-stakes cases only

Each tier has defined cost, latency, and quality thresholds. On top of this sits an observability layer that tracks which tasks are going where, at what cost, and with what outcomes, so routing decisions can be continuously calibrated rather than set once and forgotten.

The organizations that reduce AI agent orchestration costs at scale aren't running better models. They're running better systems around models, with architecture that matches spend to need at every step.

Why Most Teams Haven't Built This Yet

There are really two reasons and neither has anything to do with a lack of skill.

Reason 1 The early pain isn't visible.
When you're running a proof of concept, the cost difference between a large model and a small one feels abstract. It only becomes obvious at scale, when the budget impact is undeniable and the system is already too embedded to change easily.

Reason 2 Tiered orchestration is genuinely harder to build.
A single model pointed at a task is simple. An orchestration layer that correctly classifies tasks, routes them, handles edge cases, and maintains consistency across multiple models is a serious systems problem. It's the kind that takes six to eighteen months to build properly.

Table showing what looks fine early versus what breaks at scale across five barriers: invisible cost causes budget spikes during scaling, single-model setup becomes high cost-per-task, orchestration complexity becomes unavoidable and expensive to retrofit, infrastructure gap produces fragile unscalable systems, and late realization forces expensive re-architecture with embedded dependencies

The Agent Reality Check

Let's be direct: the hype cycle has significantly outpaced the deployment reality. Most of what organizations have built and called "agents" are, on close inspection, sophisticated chatbots with tool access bolted on. They fail in three specific, predictable ways and all three are architectural problems, not model quality problems.

This is precisely why now is the right moment to pivot. The infrastructure including Kubernetes, LangGraph, sandboxed execution environments, and proper observability tooling exists and is maturing. Companies that start building now will be early-to-mid players, not laggards doing emergency re-architecture two years from now.

The three agent failure modes: Failure 01 hallucination at decision points where agents hallucinate most where confidence should be lowest; Failure 02 state collapse across steps where a misread variable in step three produces a wrong output in step seven with no observable state management; Failure 03 the observability gap nobody owns where feedback loops exist on paper but never close in production

NVIDIA defines agentic systems as "autonomous, long-running agents that reason, plan and act across complex, multi-step workflows," a definition that highlights how far most current implementations still have to go. This isn't a reason to pull back but a signal to treat this like a real systems problem.

What You Should Actually Be Tracking

Tracking the right metrics requires an AI oversight framework that connects routing decisions to business outcomes, not just benchmark scores.

Most AI business cases get approved on model performance benchmarks, which is the wrong number to optimize for. The real cost including container orchestration, workflow state management, sandboxed execution, observability tooling, and routing model maintenance rarely makes it into the same deck. So the ROI gap isn't surprising. The real cost was never fully accounted for in the first place.

McKinsey estimates generative AI could add $2.6T to $4.4T annually to the global economy, with total productivity impact reaching $7.9T. The cost of getting system design wrong will scale right alongside the opportunity, not independently of it.

Three metrics to track continuously: cost per automated task which should decline as volume grows with flat or rising cost signaling wrong-tier routing; routing accuracy rate above 92% meaning tasks correctly classified by complexity; escalation override rate below 8% meaning auto-routed decisions manually corrected with high rate signaling routing model needs recalibration not more reviewers

Three metrics worth tracking instead of benchmark scores:

Cost per automated task: should decline as volume grows. Flat or rising cost signals wrong-tier routing
Routing accuracy rate: target above 92% of tasks correctly classified by complexity. Mis-routing routine tasks to frontier models is where budget leaks
Escalation override rate: target below 8% of auto-routed decisions manually corrected. A high rate signals the routing model needs recalibration, not more reviewers

Q&A: What Engineering and Architecture Teams Actually Ask

What's the difference between model routing and prompt routing?
Prompt routing selects between different prompts or instructions for the same model. Model routing selects between different models entirely based on task complexity. The distinction matters at scale: prompt routing doesn't reduce compute costs because you're still running the same model. Model routing does, by matching task complexity to appropriately sized infrastructure.

How do you classify task complexity reliably enough to route it?
Start with a lightweight classification model, often a fine-tuned smaller model trained on your own task distribution. The classification step itself costs almost nothing relative to the savings from correct routing. Track misroutes (tasks sent to the wrong tier) the same way you'd track model errors: as a calibration signal, not a failure.

What happens when a task is misclassified and routed to the wrong tier?
A task routed down (sent to a smaller model than it needs) produces a lower-quality output, detectable via output scoring or human review flags. A task routed up (sent to a larger model than needed) just costs more than necessary. Build fallback logic: if the lower-tier model's confidence score falls below a threshold, escalate automatically.

Does tiered routing work for LLM-based agents, or just classification tasks?
It works for both. For agents, the routing decision happens at the task-dispatch layer before any tool calls are made. Simple deterministic sub-tasks like formatting, extraction, and lookup go to lightweight models. Multi-step reasoning chains or ambiguous open-ended tasks go to frontier models. The orchestration layer manages the handoff.

How long does it realistically take to build a proper routing layer?
Six to eighteen months for a production-grade system, depending on the number of task types, the variance in your data distribution, and how mature your observability infrastructure is. The first version is always simpler. The hard part is continuous calibration: keeping routing decisions accurate as your task mix shifts over time.

Three Verdicts, One Principle

01 Single-model stacks are not production architectures.
Routing every task to the same frontier model has no cost-efficiency mechanism, no complexity awareness, and no path to economic viability at scale. Without an AI oversight framework to govern routing decisions, better models only delay the budget problem. They don't solve it.

02 Routing is required and it can't be an afterthought.
Bolted on after the fact, tiered orchestration requires re-architecting systems already embedded in production. The organizations building it now are the ones who won't be explaining budget overruns to their CFO eighteen months from now.

03 The infrastructure is where the advantage actually sits.
Kubernetes, LangGraph, sandboxed execution, observability tooling, feedback-integrated recalibration. These aren't operational add-ons. The organizations with structural AI advantages aren't running the most powerful models. They're the ones who figured out that the game is about using the right model for each task and built the systems to make that happen.

"Enterprises that build intelligent orchestration into their AI systems early will run dramatically more automations per dollar of cloud spend. The competitive advantage in agentic AI is not a better model. It is a better system."

That's not an AI strategy. It's a systems design strategy, applied to AI. And that distinction is where most of the real value is going to be created.

Everything else works right up until it hits a budget ceiling.

About the Author:
Arleen Kaur writes about enterprise AI, system architecture, and the gap between AI pilots and production systems at Linksoft Technologies, a custom software development company.

Sources referenced:
Sequoia Capital -- $500B AI infrastructure revenue gap analysis
McKinsey -- Generative AI economic impact ($2.6T to $4.4T annually)
NVIDIA -- Agentic AI system definition

Top comments (0)