Do AI Agent Costs Grow Exponentially? I Ran My Logs and the Answer Surprised Me
A Hacker News thread hit 208 points this week asking whether AI agent costs grow exponentially with complexity. The discussion is mostly theoretical — mathematical models, algorithmic complexity analysis, extrapolations. Interesting stuff. But I have something better: months of real logs from agents I've been running in production and in personal projects. And the empirical answer is completely different from what the thread concludes. It's not exponential. But it's not linear or predictable either. It's jumps. Discrete jumps that you're triggering without even realizing it.
AI Agent Costs in 2025: What Theory Says vs. What My Logs Say
The popular hypothesis — the one the HN thread basically takes for granted — is that if an agent handles tasks of complexity N, the token cost grows as O(N²) or worse. The logic is intuitive: more context, more tools, more iterations, all multiplying against each other.
My experience says something else entirely.
I pulled three months of agent logs: the curation system I described in the post about stale Awesome lists, the experiments with SPICE + Claude Code, the metrics I started tracking after building CodeBurn, and some internal automation agents I never published. Total: 847 agent runs, 23 distinct tasks, three different models.
When I graphed cost vs. task complexity, I expected a curve. What I saw were steps.
# Cost distribution analysis per run
# Real anonymized data from my logs
import pandas as pd
import numpy as np
# Load logs exported from CodeBurn
df = pd.read_csv('agent_runs_q1_q2_2025.csv')
# Classify by cost range
df['cost_range'] = pd.cut(
df['total_tokens'],
bins=[0, 5000, 20000, 80000, 200000, np.inf],
labels=['micro', 'small', 'medium', 'large', 'monstrous']
)
# What I expected: gradual distribution
# What I found: strong clustering in specific ranges
print(df['cost_range'].value_counts())
# Actual output:
# small 312 (36.8%)
# micro 298 (35.2%)
# medium 187 (22.1%)
# large 41 ( 4.8%)
# monstrous 9 ( 1.1%)
72% of my runs live in the two cheapest ranges. The "monstrous" runs? Nine. Nine. Over three months.
But those 9 runs accounted for 31% of my total API spend.
Where the Jumps Are: The Decisions That Don't Feel Like Decisions
Here's where it gets interesting. When I dug into what was actually causing the jumps — especially the large and monstrous runs — I found very specific patterns. It's not task complexity. It's how I designed the agent.
Jump 1: unbounded accumulative context
The most expensive mistake I made was in the curation agent. The initial design accumulated the full decision history in the agent's context. The logic was: "it needs to know what it decided before to stay consistent."
Correct in theory. Catastrophic in practice.
// Original version — the one that cost me
async function processBatch(items: CuratedItem[], history: Decision[]) {
const context = {
// ERROR: full history always included
// With 200 items processed, this became enormous
fullHistory: history, // ← here's the problem
currentItem: items[0]
}
return await claude.complete(buildPrompt(context))
}
// Fixed version — sliding window
async function processBatchV2(items: CuratedItem[], history: Decision[]) {
const context = {
// Only the last N relevant decisions
recentHistory: history.slice(-10), // ← fixed window
// Compressed summary of earlier history
historySummary: history.length > 10
? await generateSummary(history.slice(0, -10))
: null,
currentItem: items[0]
}
return await claude.complete(buildPrompt(context))
}
This single change reduced that agent's cost by 67%. I didn't change the model. I didn't change the task. I changed how I handle context.
Jump 2: the "just in case" model
After the conversation about scarcity in frontier models, I went back and audited which model I was using for each subtask in my agents. I found something embarrassing: I was using Opus for simple classification tasks because "if it fails, the whole agent is broken and that's a problem."
That's fear dressed up as architecture.
The reality: for classifying whether a URL is relevant or not, Claude Haiku with a well-written prompt hits 94% accuracy on my dataset. Opus hits 97%. I was paying 15x more for 3 percentage points on a task where failure has zero cost (I just reclassify the edge case).
// Model map by task type — what I actually implemented
const MODEL_BY_TASK = {
// Simple binary classification → cheap model
relevance_classification: 'claude-haiku-4-5',
// Structured extraction → mid-tier model
metadata_extraction: 'claude-sonnet-4-5',
// Complex reasoning, consequential decisions → expensive model
architecture_analysis: 'claude-opus-4-5',
// Code generation with broad context → mid-tier model
code_generation: 'claude-sonnet-4-5',
} as const
type TaskType = keyof typeof MODEL_BY_TASK
async function runWithCorrectModel(
task: TaskType,
prompt: string
) {
const model = MODEL_BY_TASK[task]
return await anthropic.messages.create({
model: model,
messages: [{ role: 'user', content: prompt }],
max_tokens: 1024
})
}
Jump 3: the loop with no exit condition
This one cost me a cool $23 in a single night. An agent designed to iterate until it was "satisfied" with the result. Without defining what satisfied means. Without an iteration limit.
The agent ran 47 times on the same task.
// The classic mistake
async function iterativeAgent(task: string) {
let result = ''
let satisfied = false
// NO LIMIT — this is a time bomb
while (!satisfied) {
result = await executeStep(task, result)
satisfied = await evaluateQuality(result) // LLM evaluating LLM
}
return result
}
// Version with circuit breaker
async function safeIterativeAgent(
task: string,
maxIterations: number = 5 // always an explicit limit
) {
let result = ''
let iteration = 0
while (iteration < maxIterations) {
result = await executeStep(task, result)
const evaluation = await evaluateQuality(result)
if (evaluation.score >= 0.85) break // numeric threshold, not a vibe check
iteration++
// Log to catch expensive loops early
if (iteration >= 3) {
console.warn(`⚠️ Agent on iteration ${iteration} — review design`)
}
}
return { result, iterations: iteration }
}
This combination — unbounded loops + LLM evaluating LLM — is the most expensive pattern I've seen in my logs. The edge inference agent with Cloudflare taught me that moving the evaluation point can radically change the cost profile of a loop.
The Gotchas That Aren't in Any Tutorial
The cost of badly implemented "memory." Every agent framework has some kind of memory. Almost none of them explain that naive memory — saving everything — is exponential in cost. Every new message pays for all previous messages. You need active compression or selective retrieval, not append-only storage.
Unnecessarily bloated JSON. My agents pass state around in JSON. For weeks I never thought about the size of that JSON. It had debug fields, redundant metadata, timestamps in full ISO format. Compressing the JSON schema circulating in the agent's context saved me an average of 800 tokens per call. Sounds small. At 300 calls a day, it's not small.
The system prompt that duplicates itself. In some frameworks, if you're not careful, the system prompt gets included in every history message in addition to the system slot. I caught it by staring at tokenization logs. It was a 2,000-token system prompt appearing 8 times in a context window. 16,000 tokens of pure overhead.
The tool that always calls the most expensive tool. If your agent has tools with very different costs (one calls GPT-4o, another does a local database lookup) and the agent learns to prefer the LLM tool because "it's more flexible," you'll have a cost problem that looks random but isn't.
FAQ: AI Agent Costs in 2025
Do AI agent costs really grow exponentially?
Empirically, in my data: no. They grow in discrete jumps tied to specific design decisions. The exponential growth the theory shows assumes unlimited context with no compression — nobody should be designing an agent that way. The real problem isn't algorithmic complexity, it's sloppy architecture: unbounded accumulative context, loops with no exit condition, and "just in case" model selection.
How much does a well-designed agent spend per task on average?
It varies a lot by task, but in my logs 72% of runs fall between 1,000 and 20,000 tokens. At current Claude Sonnet pricing, that's $0.003 to $0.06 per run. The "monstrous" runs (200k+ tokens) represent 1% of cases but 31% of spend — that's exactly where you should be looking.
Is it worth using cheaper models for subtasks?
Yes, unambiguously. The task-complexity routing pattern — Haiku for classification, Sonnet for generation, Opus only for complex reasoning — cut my monthly spend by 40% with no perceptible degradation in final output quality. The key is defining quality metrics per subtask so you know which model is actually good enough.
How do I know if my agent has a cost problem before it blows up?
Three early warning signals: (1) cost per run has very high variance — if some runs cost 10x the average, you have a badly designed pattern, (2) the number of tool-calls is growing faster than task complexity, (3) average context size is growing run-over-run instead of staying stable. Tools like CodeBurn help catch this systematically.
Are circuit breakers in agents over-engineering?
No. They're the minimum viable safety net. An agent with no iteration limit in production is an incident waiting to happen. The limit doesn't have to be rigid — it can be "5 iterations, or when the score exceeds 0.85, whichever comes first" — but it has to exist. The cost of an infinite loop is potentially unlimited, and LLMs are not going to tell you "hey, this isn't working, stop."
Is it better to have an agent with many specialized tools or few general-purpose tools?
On cost: many specialized tools wins, as long as routing is solid. The problem with general-purpose tools is the agent tends to use them for everything, including cases where a cheap, specific tool would have been enough. The routing overhead with more tools is real but smaller than the overhead of using an expensive tool when you didn't need to.
The Conclusion I Didn't See Coming
I started this analysis to respond to the HN thread. I ended up realizing something more uncomfortable: most of my expensive runs were ones I generated myself. Not because of task complexity. Because of design decisions I made in 30 seconds, without thinking about cost, because "it worked."
The exponential growth of agent costs is a partially true myth. It's true if you design without thinking. It's false if you treat context, loops, and model selection as architectural decisions with real economic consequences.
The good news: once you find the jumps in your own logs, they're easy to eliminate. You don't need to change the model, the task, or the framework. You need to change how you think about your agent's state and context.
The bad news: nobody's going to warn you that you're generating them until the invoice shows up.
If you're not measuring your runs, start today. Not tomorrow. Today.
Top comments (0)