DEV Community

ModelIndex
ModelIndex

Posted on

AI Agents Don’t Scale Like Chatbots

Originally published on Medium:
https://medium.com/@ravi.myakala/ai-agents-dont-scale-like-chatbots-2434e4fbe321

Most LLM cost estimates use something like:

cost = requests * avg_tokens * price_per_token
Enter fullscreen mode Exit fullscreen mode

That works for chat systems.
It breaks for AI agents.

In multi-step agent systems, cost isn’t driven primarily by request volume — it’s driven by execution depth.


Chat Workloads (Linear Scaling)

A typical chat interaction looks like:

User request
   ↓
LLM
   ↓
Response
Enter fullscreen mode Exit fullscreen mode
cost ≈ requests * tokens_per_request

Enter fullscreen mode Exit fullscreen mode

If traffic doubles, cost doubles.
Predictable. Linear.


Agent Workloads (Internal Multiplication)

Now compare that with a tool-using agent:

User task
   ↓
Reasoning step
   ↓
Tool call
   ↓
Reflection
   ↓
Another tool call
   ↓
More reasoning
   ↓
Final output
Enter fullscreen mode Exit fullscreen mode

Chat v/s Agent Cost Structure

A single task can trigger multiple LLM invocations.
This internal expansion is the structural difference.


The Real Agent Cost Model

Instead of:

cost ≈ requests * tokens

Enter fullscreen mode Exit fullscreen mode

Agent systems look more like:

cost ≈ (
    tasks
    * execution_depth
    * tokens_per_step
    * retry_multiplier
    * burst_factor
    * price_per_token
)
Enter fullscreen mode Exit fullscreen mode

Where:
execution_depth = number of reasoning/tool steps per task
retry_multiplier = amplification from tool failures
burst_factor = volatility from uneven task complexity

The dominant driver becomes execution depth, not traffic.

Why Teams Underestimate Agent Cost

Common failure points:

  1. Execution Depth Creep
    Workflows evolve from 3 steps to 6–8 steps over time.

  2. Retry Amplification
    Tool failures add extra reasoning cycles.

  3. Context Accumulation
    Memory grows across steps.

  4. Burst Volatility
    Some tasks expand far deeper than others.

By the time telemetry shows cost drift, the architecture is already deployed.

A Canonical Agent Scenario

I modeled a canonical multi-step AI agent workload with:

  • Controlled execution depth
  • Tool retries
  • Context accumulation
  • Burst volatility

Full structural breakdown here:
👉 https://www.modelindex.io/scenarios/ai-agent

The goal isn’t benchmarking models — it’s understanding structural cost behavior before deployment.

Key Takeaway

Chat systems scale with traffic.
Agent systems scale with internal execution depth.
If you’re modeling cost for multi-step workflows, execution depth is the variable you should track first.

Would love to hear how others are forecasting agent cost in production.

Top comments (6)

Collapse
 
almadar profile image
Osama Alghanmi

Great article - you've identified the core architectural problem. The "reasoning → tool call → reflection → another tool call" loop creates exactly the cost explosion you describe.

The solution is shifting from reactive agents to composed workflows (plan-then-execute). Here's how it changes the cost structure:

Reactive vs Composed Cost Model

Reactive Agents (what you described):

cost = tasks × execution_depth × tokens_per_step × retry_multiplier
Enter fullscreen mode Exit fullscreen mode

Execution depth is unbounded - the LLM decides when to stop, leading to the "depth creep" you mentioned.

Composed Workflows:

cost = planning_cost + Σ(generation_steps) + verification_cost
Enter fullscreen mode Exit fullscreen mode

Execution depth is fixed by the plan - typically 3-6 deterministic steps.

The Pattern

Instead of:

User asks → LLM reasons → Tool call → LLM reflects → Another tool call → ...
Enter fullscreen mode Exit fullscreen mode

You do:

User asks → [1] LLM creates plan → [2] Execute plan steps (parallel if possible) → [3] Verify results
Enter fullscreen mode Exit fullscreen mode

Phase 1: Planning (Strong model)

  • Decompose request into discrete units
  • Fixed 1 LLM call

Phase 2: Execution (Routed by complexity)

  • Simple work: Fast/cheap model sequentially
  • Complex work: Multiple models in parallel

Phase 3: Verification

  • Validate output
  • Retry failed units only

Why This Changes the Cost Equation

Factor Reactive Composed
Execution depth Unbounded (LLM decides) Fixed by plan
Tool calls Variable per task Deterministic
Reflection loops Yes - major cost driver No
Parallelization Hard (stateful) Easy (stateless)
Cost predictability Low High

Results We've Seen

Approach Complex Task Cost Tool Calls
Reactive (Claude) ~6 min ~$2.27 18
Composed Multi-Provider ~5 min ~$0.50-0.75 6

The key wins:

  1. Bounded steps - Plan → Execute → Verify (fixed 3 phases)
  2. No reflection loops - Eliminates the "let me think about this" recursion
  3. Parallel execution - Route work to multiple providers simultaneously
  4. Deterministic routing - Complexity classifier picks strategy upfront

The Trade-off

Composed workflows work when you can pre-define the execution graph. The LLM plans, then the system executes - no iterative reasoning during execution.

For truly open-ended exploration (where the next step depends on the result of the previous in unpredictable ways), reactive agents are still needed. But you pay exactly the cost you described.

The industry is moving toward this hybrid: plan with strong models, execute with cheap models, parallelize where possible.

Would love to compare notes on cost forecasting - we're tracking execution depth as the primary metric rather than request volume.

this is done using deepagents library of course from langgraph.

Collapse
 
modelin_409b9ef89fbc profile image
ModelIndex

Really good breakdown.

What you’re describing is essentially moving execution depth from emergent to bounded.

Reactive agents: depth is LLM-driven.
Composed workflows: depth is architecture-driven.

That alone changes cost predictability dramatically.

I like the retry scoping point as well — retrying units vs chains removes a major amplification vector.

Curious how you’re modeling plan complexity distribution in forecasting — fixed constant or variable?

Collapse
 
almadar profile image
Osama Alghanmi • Edited

Good catch — it's variable, not fixed.

We classify complexity upfront before planning: simple, medium, complex. Each tier has a different expected step count (roughly 2, 4, 6). The
classifier runs on the raw user request before any LLM call — it's a lightweight model, so the overhead is negligible.

So the forecasting model is:

cost = classify(request) → tier → expected_steps × avg_tokens_per_step × model_rate

The constant assumption only holds within a tier. Across tiers, the distribution is roughly what you'd expect — most requests fall in the middle, with
long tails on both ends.

The number that surprised us: the classifier is right about 80% of the time, but the 20% misclassifications are almost always underestimates (complex
tasks classified as medium). So we apply a small upward correction factor specifically to the medium tier.

Still running evals to improve classification accuracy. I have to mention that my use case or workflow is pretty specific, which makes it easier for me to measure things. Basically, the decomposition step we have breaks things into the building blocks we designed (we call them orbitals; they are part of our programming language Almadar). So, this decomposition step can be measured by ensuring it breaks things down correctly into the right number of "Orbitals" or "Building blocks" that represent an application (our use case is to generate a JSON-like schema that converts to code).

Thread Thread
 
modelin_409b9ef89fbc profile image
ModelIndex

This is super interesting, especially the asymmetric misclassification toward underestimation. That’s exactly where cost forecasting usually breaks.

You’ve basically moved uncertainty from execution depth to classification accuracy.

Reactive: depth uncertainty.
Composed: tier uncertainty.

The orbital decomposition gives you measurable structural units — which makes forecasting far more deterministic than open-ended agent workflows.

Have you tried modeling expected cost as:

Σ P(tier) × cost(tier)

plus a bias correction for misclassification? Might make the risk more explicit.

Really appreciate you sharing this.

Thread Thread
 
almadar profile image
Osama Alghanmi

It's my pleasure, thank you for sharing your knowledge with us.
I will definitely try this cost modeling approach. It makes sense. It will be part of our evals.

Thread Thread
 
modelin_409b9ef89fbc profile image
ModelIndex

Glad it’s useful — and your classification-based approach is a strong way to bound depth risk.

It’s interesting how the uncertainty just moves layers rather than disappearing.

Would love to hear how the evals turn out, especially around misclassification bias.