Originally published on Medium:
https://medium.com/@ravi.myakala/ai-agents-dont-scale-like-chatbots-2434e4fbe321
Most LLM cost estimates use something like:
cost = requests * avg_tokens * price_per_token
That works for chat systems.
It breaks for AI agents.
In multi-step agent systems, cost isn’t driven primarily by request volume — it’s driven by execution depth.
Chat Workloads (Linear Scaling)
A typical chat interaction looks like:
User request
↓
LLM
↓
Response
cost ≈ requests * tokens_per_request
If traffic doubles, cost doubles.
Predictable. Linear.
Agent Workloads (Internal Multiplication)
Now compare that with a tool-using agent:
User task
↓
Reasoning step
↓
Tool call
↓
Reflection
↓
Another tool call
↓
More reasoning
↓
Final output
A single task can trigger multiple LLM invocations.
This internal expansion is the structural difference.
The Real Agent Cost Model
Instead of:
cost ≈ requests * tokens
Agent systems look more like:
cost ≈ (
tasks
* execution_depth
* tokens_per_step
* retry_multiplier
* burst_factor
* price_per_token
)
Where:
execution_depth = number of reasoning/tool steps per task
retry_multiplier = amplification from tool failures
burst_factor = volatility from uneven task complexity
The dominant driver becomes execution depth, not traffic.
Why Teams Underestimate Agent Cost
Common failure points:
Execution Depth Creep
Workflows evolve from 3 steps to 6–8 steps over time.Retry Amplification
Tool failures add extra reasoning cycles.Context Accumulation
Memory grows across steps.Burst Volatility
Some tasks expand far deeper than others.
By the time telemetry shows cost drift, the architecture is already deployed.
A Canonical Agent Scenario
I modeled a canonical multi-step AI agent workload with:
- Controlled execution depth
- Tool retries
- Context accumulation
- Burst volatility
Full structural breakdown here:
👉 https://www.modelindex.io/scenarios/ai-agent
The goal isn’t benchmarking models — it’s understanding structural cost behavior before deployment.
Key Takeaway
Chat systems scale with traffic.
Agent systems scale with internal execution depth.
If you’re modeling cost for multi-step workflows, execution depth is the variable you should track first.
Would love to hear how others are forecasting agent cost in production.

Top comments (6)
Great article - you've identified the core architectural problem. The "reasoning → tool call → reflection → another tool call" loop creates exactly the cost explosion you describe.
The solution is shifting from reactive agents to composed workflows (plan-then-execute). Here's how it changes the cost structure:
Reactive vs Composed Cost Model
Reactive Agents (what you described):
Execution depth is unbounded - the LLM decides when to stop, leading to the "depth creep" you mentioned.
Composed Workflows:
Execution depth is fixed by the plan - typically 3-6 deterministic steps.
The Pattern
Instead of:
You do:
Phase 1: Planning (Strong model)
Phase 2: Execution (Routed by complexity)
Phase 3: Verification
Why This Changes the Cost Equation
Results We've Seen
The key wins:
The Trade-off
Composed workflows work when you can pre-define the execution graph. The LLM plans, then the system executes - no iterative reasoning during execution.
For truly open-ended exploration (where the next step depends on the result of the previous in unpredictable ways), reactive agents are still needed. But you pay exactly the cost you described.
The industry is moving toward this hybrid: plan with strong models, execute with cheap models, parallelize where possible.
Would love to compare notes on cost forecasting - we're tracking execution depth as the primary metric rather than request volume.
this is done using deepagents library of course from langgraph.
Really good breakdown.
What you’re describing is essentially moving execution depth from emergent to bounded.
Reactive agents: depth is LLM-driven.
Composed workflows: depth is architecture-driven.
That alone changes cost predictability dramatically.
I like the retry scoping point as well — retrying units vs chains removes a major amplification vector.
Curious how you’re modeling plan complexity distribution in forecasting — fixed constant or variable?
Good catch — it's variable, not fixed.
We classify complexity upfront before planning: simple, medium, complex. Each tier has a different expected step count (roughly 2, 4, 6). The
classifier runs on the raw user request before any LLM call — it's a lightweight model, so the overhead is negligible.
So the forecasting model is:
cost = classify(request) → tier → expected_steps × avg_tokens_per_step × model_rate
The constant assumption only holds within a tier. Across tiers, the distribution is roughly what you'd expect — most requests fall in the middle, with
long tails on both ends.
The number that surprised us: the classifier is right about 80% of the time, but the 20% misclassifications are almost always underestimates (complex
tasks classified as medium). So we apply a small upward correction factor specifically to the medium tier.
Still running evals to improve classification accuracy. I have to mention that my use case or workflow is pretty specific, which makes it easier for me to measure things. Basically, the decomposition step we have breaks things into the building blocks we designed (we call them orbitals; they are part of our programming language Almadar). So, this decomposition step can be measured by ensuring it breaks things down correctly into the right number of "Orbitals" or "Building blocks" that represent an application (our use case is to generate a JSON-like schema that converts to code).
This is super interesting, especially the asymmetric misclassification toward underestimation. That’s exactly where cost forecasting usually breaks.
You’ve basically moved uncertainty from execution depth to classification accuracy.
Reactive: depth uncertainty.
Composed: tier uncertainty.
The orbital decomposition gives you measurable structural units — which makes forecasting far more deterministic than open-ended agent workflows.
Have you tried modeling expected cost as:
Σ P(tier) × cost(tier)
plus a bias correction for misclassification? Might make the risk more explicit.
Really appreciate you sharing this.
It's my pleasure, thank you for sharing your knowledge with us.
I will definitely try this cost modeling approach. It makes sense. It will be part of our evals.
Glad it’s useful — and your classification-based approach is a strong way to bound depth risk.
It’s interesting how the uncertainty just moves layers rather than disappearing.
Would love to hear how the evals turn out, especially around misclassification bias.