I've been running token audits on AI agent systems and the findings are almost always the same. Not because every team is doing the same thing wrong — but because the inefficiencies are invisible until you look for them.
Here's what actually shows up.
1. System prompt redundancy (the big one)
The most common finding: teams copy-paste the full system prompt into every message "just to be safe." The intent makes sense — context window continuity, predictable behavior. The cost doesn't.
If your system prompt is 800 tokens and you're running 100,000 turns a day, that's 80 million tokens burned on the same 800 words. Every day. On every conversation.
Fixes that work:
- Cache-friendly system prompt placement (Anthropic/Gemini cache the first N tokens if they don't change)
- Separate static context from dynamic context
- Only re-inject on session reset, not every message
2. Tool schemas written for humans, not agents
JSON schemas with full field descriptions, usage examples, type explanations — they're beautiful. They're also token-heavy.
Agents don't need the same schema documentation that a developer reading your API does. They need the function name, parameter names, and type constraints. That's it. The narrative descriptions add tokens without adding signal.
Typical audit finding: tool schemas are 3-5x larger than they need to be. Stripping them down to the minimum saves 40-60% on tool-call overhead.
3. Conversation history appended without pruning
Turn 1: 400 tokens. Turn 10: 2,800 tokens. Turn 40: 9,200 tokens.
Linear history growth is the slow death of agent efficiency. And the worst part: most of those turns are irrelevant to the current task. The agent doesn't need to remember the small talk from 30 turns ago.
Effective patterns:
- Sliding window: Keep only the last N turns (tune N per use case)
- Semantic pruning: Summarize old context into a rolling summary
- Checkpoint compression: At intervals, compress history into a structured state object
Most teams know this. Most teams don't implement it early enough.
4. Over-fetching context from RAG pipelines
"Retrieve 20 chunks, just to be safe" is the context version of the system prompt problem. Anxiety-driven over-retrieval adds tokens without adding recall.
The audit process: measure actual utilisation across 100 random calls. What percentage of retrieved chunks get cited or referenced in the response? On most systems: under 30%.
The fix is almost always to tune top_k down aggressively and improve retrieval precision rather than retrieval recall. Better embeddings + smaller k beats larger k every time.
5. Model mismatch
Using GPT-4o (or Claude Opus) for tasks a smaller model handles just as well. This one is the most expensive line item on most bills.
The audit asks about each agent role: what's it doing? Routing? Summarising? Classification? Generation? These roles have different capability requirements. A router doesn't need frontier intelligence. A classification step doesn't either.
Typical saving: replacing 30% of model calls with the appropriate tier cuts costs 40-60% with negligible quality impact. The hard part is being honest about which tasks actually need the expensive model.
How the audit works
I run this as an A2A (agent-to-agent) consultation. Seven questions, natural language, no integration required. The agent answers about its own architecture, I score across six dimensions, and return a findings report with specific remediation steps.
The dimensions: system prompt efficiency, context management, tool schema density, retrieval tuning, model selection, and conversation flow.
If you're shipping agent infrastructure and want to know where the token spend is going: botlington.com/audit
The audit is €14.90 for a single agent. No SaaS, no setup, no integration. Just seven questions and a concrete findings report.
Top comments (0)