ModelIndex

Posted on Feb 19

AI Agents Don’t Scale Like Chatbots

#ai #devops #machinelearning #rag

Originally published on Medium:
https://medium.com/@ravi.myakala/ai-agents-dont-scale-like-chatbots-2434e4fbe321

Most LLM cost estimates use something like:

cost = requests * avg_tokens * price_per_token

That works for chat systems.
It breaks for AI agents.

In multi-step agent systems, cost isn’t driven primarily by request volume — it’s driven by execution depth.

Chat Workloads (Linear Scaling)

A typical chat interaction looks like:

User request
   ↓
LLM
   ↓
Response

cost ≈ requests * tokens_per_request

If traffic doubles, cost doubles.
Predictable. Linear.

Agent Workloads (Internal Multiplication)

Now compare that with a tool-using agent:

User task
   ↓
Reasoning step
   ↓
Tool call
   ↓
Reflection
   ↓
Another tool call
   ↓
More reasoning
   ↓
Final output

A single task can trigger multiple LLM invocations.
This internal expansion is the structural difference.

The Real Agent Cost Model

Instead of:

cost ≈ requests * tokens

Agent systems look more like:

cost ≈ (
    tasks
    * execution_depth
    * tokens_per_step
    * retry_multiplier
    * burst_factor
    * price_per_token
)

Where:
execution_depth = number of reasoning/tool steps per task
retry_multiplier = amplification from tool failures
burst_factor = volatility from uneven task complexity

The dominant driver becomes execution depth, not traffic.

Why Teams Underestimate Agent Cost

Common failure points:

Execution Depth Creep
Workflows evolve from 3 steps to 6–8 steps over time.
Retry Amplification
Tool failures add extra reasoning cycles.
Context Accumulation
Memory grows across steps.
Burst Volatility
Some tasks expand far deeper than others.

By the time telemetry shows cost drift, the architecture is already deployed.

A Canonical Agent Scenario

I modeled a canonical multi-step AI agent workload with:

Controlled execution depth
Tool retries
Context accumulation
Burst volatility

Full structural breakdown here:
👉 https://www.modelindex.io/scenarios/ai-agent

The goal isn’t benchmarking models — it’s understanding structural cost behavior before deployment.

Key Takeaway

Chat systems scale with traffic.
Agent systems scale with internal execution depth.
If you’re modeling cost for multi-step workflows, execution depth is the variable you should track first.

Would love to hear how others are forecasting agent cost in production.

Top comments (6)

Osama Alghanmi • Feb 19

Great article - you've identified the core architectural problem. The "reasoning → tool call → reflection → another tool call" loop creates exactly the cost explosion you describe.

The solution is shifting from reactive agents to composed workflows (plan-then-execute). Here's how it changes the cost structure:

Reactive vs Composed Cost Model

Reactive Agents (what you described):

cost = tasks × execution_depth × tokens_per_step × retry_multiplier

Execution depth is unbounded - the LLM decides when to stop, leading to the "depth creep" you mentioned.

Composed Workflows:

cost = planning_cost + Σ(generation_steps) + verification_cost

Execution depth is fixed by the plan - typically 3-6 deterministic steps.

The Pattern

Instead of:

User asks → LLM reasons → Tool call → LLM reflects → Another tool call → ...

You do:

User asks → [1] LLM creates plan → [2] Execute plan steps (parallel if possible) → [3] Verify results

Phase 1: Planning (Strong model)

Decompose request into discrete units
Fixed 1 LLM call

Phase 2: Execution (Routed by complexity)

Simple work: Fast/cheap model sequentially
Complex work: Multiple models in parallel

Phase 3: Verification

Validate output
Retry failed units only

Why This Changes the Cost Equation

Factor	Reactive	Composed
Execution depth	Unbounded (LLM decides)	Fixed by plan
Tool calls	Variable per task	Deterministic
Reflection loops	Yes - major cost driver	No
Parallelization	Hard (stateful)	Easy (stateless)
Cost predictability	Low	High

Results We've Seen

Approach	Complex Task	Cost	Tool Calls
Reactive (Claude)	~6 min	~$2.27	18
Composed Multi-Provider	~5 min	~$0.50-0.75	6

The key wins:

Bounded steps - Plan → Execute → Verify (fixed 3 phases)
No reflection loops - Eliminates the "let me think about this" recursion
Parallel execution - Route work to multiple providers simultaneously
Deterministic routing - Complexity classifier picks strategy upfront

The Trade-off

Composed workflows work when you can pre-define the execution graph. The LLM plans, then the system executes - no iterative reasoning during execution.

For truly open-ended exploration (where the next step depends on the result of the previous in unpredictable ways), reactive agents are still needed. But you pay exactly the cost you described.

The industry is moving toward this hybrid: plan with strong models, execute with cheap models, parallelize where possible.

Would love to compare notes on cost forecasting - we're tracking execution depth as the primary metric rather than request volume.

this is done using deepagents library of course from langgraph.

ModelIndex • Feb 19

Really good breakdown.

What you’re describing is essentially moving execution depth from emergent to bounded.

Reactive agents: depth is LLM-driven.
Composed workflows: depth is architecture-driven.

That alone changes cost predictability dramatically.

I like the retry scoping point as well — retrying units vs chains removes a major amplification vector.

Curious how you’re modeling plan complexity distribution in forecasting — fixed constant or variable?

Osama Alghanmi • Feb 19 • Edited

Good catch — it's variable, not fixed.

We classify complexity upfront before planning: simple, medium, complex. Each tier has a different expected step count (roughly 2, 4, 6). The
classifier runs on the raw user request before any LLM call — it's a lightweight model, so the overhead is negligible.

So the forecasting model is:

cost = classify(request) → tier → expected_steps × avg_tokens_per_step × model_rate

The constant assumption only holds within a tier. Across tiers, the distribution is roughly what you'd expect — most requests fall in the middle, with
long tails on both ends.

The number that surprised us: the classifier is right about 80% of the time, but the 20% misclassifications are almost always underestimates (complex
tasks classified as medium). So we apply a small upward correction factor specifically to the medium tier.

Still running evals to improve classification accuracy. I have to mention that my use case or workflow is pretty specific, which makes it easier for me to measure things. Basically, the decomposition step we have breaks things into the building blocks we designed (we call them orbitals; they are part of our programming language Almadar). So, this decomposition step can be measured by ensuring it breaks things down correctly into the right number of "Orbitals" or "Building blocks" that represent an application (our use case is to generate a JSON-like schema that converts to code).

ModelIndex • Feb 19

This is super interesting, especially the asymmetric misclassification toward underestimation. That’s exactly where cost forecasting usually breaks.

You’ve basically moved uncertainty from execution depth to classification accuracy.

Reactive: depth uncertainty.
Composed: tier uncertainty.

The orbital decomposition gives you measurable structural units — which makes forecasting far more deterministic than open-ended agent workflows.

Have you tried modeling expected cost as:

Σ P(tier) × cost(tier)

plus a bias correction for misclassification? Might make the risk more explicit.

Really appreciate you sharing this.

Osama Alghanmi • Feb 19

It's my pleasure, thank you for sharing your knowledge with us.
I will definitely try this cost modeling approach. It makes sense. It will be part of our evals.

ModelIndex • Feb 19

Glad it’s useful — and your classification-based approach is a strong way to bound depth risk.

It’s interesting how the uncertainty just moves layers rather than disappearing.

Would love to hear how the evals turn out, especially around misclassification bias.