đź§ Key Idea
Stop thinking in terms of cost per request. Instead, measure cost per successful task, and break total spend into four buckets:
- Base generation
- Context bloat
- Retries & timeouts
- Tool/agent loops
By identifying which bucket dominates your spend, you know what to fix first. :contentReference[oaicite:1]{index=1}
đź§° What You Need Before Starting
To run this audit, gather whichever of these you have:
- Option A (best): per-request logs with model name, tokens, status, timestamp
- Option B: OpenAI usage export + partial app logs
- Option C: Total cost per model/day (estimate)
Even with limited data, you can still discover the biggest cost drivers. :contentReference[oaicite:2]{index=2}
⏱️ The 45-Minute Audit Plan
Minute 0–5: Define Your Unit of Success
Define what counts as a successful task, such as:
- Grounded answer with no fallback
- No retries/timeouts
- Tool workflow completes without loop
Then compute:
cost per successful task = total tokens / successful tasks
This gives actionable grounding for the rest of the audit. :contentReference[oaicite:3]{index=3}
Minute 5–15: Break Spend into Four Buckets
Break total spending into:
- Base generation tokens — prompt + normal output
- Context bloat tokens — system prompt, history, RAG context
- Retries & timeouts waste — tokens burned on failed attempts
- Tool/agent loop waste — unnecessary repeated calls
Rank these buckets to see which drives most spend. :contentReference[oaicite:4]{index=4}
Minute 15–25: Token Spend Decomposition
Sample ~200–500 requests and compute:
- Input token breakdown: system + history + RAG + tool tokens
- Output token totals
- Retries/timeouts waste
Even rough estimates reveal which drivers are outsized. :contentReference[oaicite:5]{index=5}
Minute 25–35: Find the “Silent Spenders”
Sort requests by:
- Cost per request
- Highest input tokens
- Retry rates
- Tool loop counts
Typical patterns include:
- Context bloat
- Retry storms
- Agent/tool loops
- Model misrouting
- Over-generation :contentReference[oaicite:6]{index=6}
Minute 35–40: Segment Spend by Cohort
Break costs down by:
- Intent category
- Customer tier
- Product surface (chat vs agent)
- Language
This uncovers specific areas leaking spend. :contentReference[oaicite:7]{index=7}
Minute 40–45: Pick the First 3 Fixes
A typical prioritized fix order:
- Stop waste — cap retries, add circuit breakers
- Cap context — limit history + RAG context
- Route smart — cheaper model for low-risk intents :contentReference[oaicite:8]{index=8}
Even these simple changes can cut cost without reducing quality.
📊 What the Audit Produces
After 45 minutes, you should have:
- A spend pie showing the four buckets
- Top cohorts by cost per success
- Top 5 “silent spender” patterns
- A ranked list of 3 practical fixes
- Validation checks & alerts for future regressions :contentReference[oaicite:9]{index=9}
🛑 What NOT To Do
- Don’t shorten system prompts blindly — evaluate first
- Don’t cap tokens globally — cap by risk or intent tier
- Don’t switch models without eval guards — cost cuts shouldn’t break accuracy :contentReference[oaicite:10]{index=10}
đź”— Related Reading
- AI Audit (full pipeline) — measure quality, latency, cost, and safety across your AI system
- LLM & RAG Audit Hub — framework, baselines, and troubleshooting for LLM production reliability
- OptyxStack — services for production AI reliability and optimization
Audit your spend before you optimize — waste often hides where you least expect it.
Top comments (0)