My agent spent $5.40 to do what a 200-line script does for free. Then I spent a weekend fixing it, and brought the same workflow down to $2.05 per run — a 62% drop with no measurable quality regression. This is the breakdown, with the actual prompt diffs and the benchmarks that mattered.
The starting point: a single agent, three habits, $847/mo
The agent I run most is a research-and-summarize loop. It searches the web, scrapes ~20 pages, drafts a structured summary, and writes a file. Sounds harmless. The bill said otherwise.
Three things were quietly hemorrhaging tokens:
- Full-page scraping with no chunking. I was dumping entire 50k-character pages into the context window, then asking the model to extract the relevant 2,000 characters. That's a textbook case of "context stuffing" — paying for the whole haystack when you only need the needle.
- Verbose system prompts that re-stated the same instructions three different ways. The model already knows how to summarize. I was paying for it to re-read my instructions every call.
- Using the most expensive model for every step. Reasoning-tier models were being used for "summarize this paragraph" — the same task a smaller model handles fine.
A 2026 Stevens Institute analysis pegs unconstrained agents at $5–$8 per task. Mine was $5.40. Textbook.
What I changed (the actual diffs)
1. Chunk-then-extract, not dump-then-ask
The old pattern:
# BAD: pay for the whole page, then ask the model to find the bit you wanted
page = fetch(url) # ~50,000 chars
response = llm(f"Summarize this page, focusing on {topic}:\n\n{page}")
The new pattern:
# GOOD: filter first, then send only what's relevant
page = fetch(url)
chunks = chunk(page, max_chars=4000)
relevant = [c for c in chunks if keyword_score(c, topic) > 0.3]
relevant = relevant[:5] # hard cap
response = llm(f"Summarize these excerpts for {topic}:\n\n" + "\n---\n".join(relevant))
Token usage dropped from ~12,500 input tokens per page to ~3,200. Quality went up — fewer hallucinations, because the model wasn't drowning in noise.
2. System prompt audit: cut 740 tokens, lost nothing
Old system prompt: 1,180 tokens.
New: 440 tokens.
The win wasn't in what I added — it was in removing redundancy. Three things got deleted:
-
Re-stated tool descriptions. The model already knows what
web_searchdoes. One short line is enough. - "Think step-by-step" boilerplate. This is now a default behavior of the better reasoning models; you don't need to pay to remind them.
- Style guides the model already follows. "Be concise," "use markdown," "cite sources" — most of these are baked into the model's training. Repeating them costs tokens every single call.
I ran the same 50-task eval suite before and after. Output quality was statistically indistinguishable. The 740 tokens saved per call added up to about $180/month on my volume.
3. Multi-model routing: GPT-5.4 isn't always the answer
This was the biggest single win. I split my agent's steps into three tiers:
| Step | Old model | New model | Cost per call |
|---|---|---|---|
| Extract key facts from chunks | GPT-5.4 | Claude 4 Sonnet | $0.003 → $0.0008 |
| Draft structured summary | GPT-5.4 | GPT-5.4 | $0.018 (unchanged) |
| Quality check + rewrite | GPT-5.4 | Claude 4 Sonnet | $0.003 → $0.0008 |
The reasoning-tier model only touches the synthesis step. Everything else runs on a cheaper, faster model that's still good enough for extractive work.
Routing logic, in 15 lines:
def route(step):
if step.requires_reasoning:
return "gpt-5.4" # synthesis, planning, judgment calls
if step.requires_long_context:
return "claude-4-sonnet" # chunk summarization, fact extraction
return "gpt-5-mini" # formatting, light edits
The benchmarks that actually mattered
I didn't trust my gut on quality. I ran a 50-task eval suite with three different rubrics:
- Fact accuracy (ground truth vs. agent output, scored by an LLM-as-judge with citations)
- Citation coverage (% of claims that trace back to a source URL)
- User satisfaction (binary thumbs from my own review of the last 30 days)
Numbers, before vs. after:
| Metric | Before | After | Change |
|---|---|---|---|
| Cost per task | $5.40 | $2.05 | -62% |
| Median latency | 41s | 28s | -32% |
| Fact accuracy | 0.81 | 0.83 | +0.02 (noise) |
| Citation coverage | 67% | 89% | +22pp |
| User satisfaction | 0.74 | 0.78 | +0.04 |
Citation coverage went up because chunk-then-extract gives the model cleaner evidence to cite. Latency dropped because smaller models respond faster. Fact accuracy was a wash — which is what you want, because the whole point was to cut cost without hurting quality.
What I'd do differently if I started today
Three things, in order of ROI:
- Add a token budget per task, hardcoded. If your agent's burn rate exceeds $X per task, kill it and return a partial result. This single line of guardrail logic is worth more than any prompt tweak.
- Cache aggressively. Anything you've fetched or extracted before is free to reuse. I was re-scraping the same URLs on every run. A 24-hour cache on extraction results cut my external API spend by another 18%.
-
Log every model's contribution separately. If you can't see which step costs what, you can't optimize it. A simple CSV log of
{task_id, step, model, input_tokens, output_tokens, cost}per run is the highest-leverage observability you'll add this year.
The reflex in 2026 is to reach for a bigger model when quality dips. Most of the time, the answer is a smaller model with a tighter context.
What I learned
- Context stuffing is the silent killer. More context isn't always better — it's almost always more expensive.
- System prompts rot. The thing you wrote six months ago is probably three times longer than it needs to be.
- Multi-model routing is the highest-leverage cost optimization available — and it's not even close.
- Quality benchmarks beat intuition every time. "It feels about the same" is a coin flip. A 50-task eval suite is an answer.
- Citation coverage is the under-rated metric. It correlates with user trust more than any other number I tracked.
The agent didn't get smarter. The pipeline got more honest about what each step actually needs.
If you're running agents in production and you haven't looked at your per-step token breakdown in the last 30 days, that's where I'd start. The $847/month I'm saving came from one weekend and three files changed.
Top comments (0)