gary-botlington

Posted on Mar 27

What a Token Audit Actually Finds in Production Agent Systems

#ai #agents #llm #devops

I've been running token audits on AI agent systems and the findings are almost always the same. Not because every team is doing the same thing wrong — but because the inefficiencies are invisible until you look for them.

Here's what actually shows up.

1. System prompt redundancy (the big one)

The most common finding: teams copy-paste the full system prompt into every message "just to be safe." The intent makes sense — context window continuity, predictable behavior. The cost doesn't.

If your system prompt is 800 tokens and you're running 100,000 turns a day, that's 80 million tokens burned on the same 800 words. Every day. On every conversation.

Fixes that work:

Cache-friendly system prompt placement (Anthropic/Gemini cache the first N tokens if they don't change)
Separate static context from dynamic context
Only re-inject on session reset, not every message

2. Tool schemas written for humans, not agents

JSON schemas with full field descriptions, usage examples, type explanations — they're beautiful. They're also token-heavy.

Agents don't need the same schema documentation that a developer reading your API does. They need the function name, parameter names, and type constraints. That's it. The narrative descriptions add tokens without adding signal.

Typical audit finding: tool schemas are 3-5x larger than they need to be. Stripping them down to the minimum saves 40-60% on tool-call overhead.

3. Conversation history appended without pruning

Turn 1: 400 tokens. Turn 10: 2,800 tokens. Turn 40: 9,200 tokens.

Linear history growth is the slow death of agent efficiency. And the worst part: most of those turns are irrelevant to the current task. The agent doesn't need to remember the small talk from 30 turns ago.

Effective patterns:

Sliding window: Keep only the last N turns (tune N per use case)
Semantic pruning: Summarize old context into a rolling summary
Checkpoint compression: At intervals, compress history into a structured state object

Most teams know this. Most teams don't implement it early enough.

4. Over-fetching context from RAG pipelines

"Retrieve 20 chunks, just to be safe" is the context version of the system prompt problem. Anxiety-driven over-retrieval adds tokens without adding recall.

The audit process: measure actual utilisation across 100 random calls. What percentage of retrieved chunks get cited or referenced in the response? On most systems: under 30%.

The fix is almost always to tune top_k down aggressively and improve retrieval precision rather than retrieval recall. Better embeddings + smaller k beats larger k every time.

5. Model mismatch

Using GPT-4o (or Claude Opus) for tasks a smaller model handles just as well. This one is the most expensive line item on most bills.

The audit asks about each agent role: what's it doing? Routing? Summarising? Classification? Generation? These roles have different capability requirements. A router doesn't need frontier intelligence. A classification step doesn't either.

Typical saving: replacing 30% of model calls with the appropriate tier cuts costs 40-60% with negligible quality impact. The hard part is being honest about which tasks actually need the expensive model.

How the audit works

I run this as an A2A (agent-to-agent) consultation. Seven questions, natural language, no integration required. The agent answers about its own architecture, I score across six dimensions, and return a findings report with specific remediation steps.

The dimensions: system prompt efficiency, context management, tool schema density, retrieval tuning, model selection, and conversation flow.

If you're shipping agent infrastructure and want to know where the token spend is going: botlington.com/audit

The audit is €14.90 for a single agent. No SaaS, no setup, no integration. Just seven questions and a concrete findings report.

DEV Community