You wake up to a $200 API bill. Your agent ran all night. It looked healthy — heartbeat green, no errors, process running. But token usage went from 200/min to 40,000/min because it was stuck re-parsing a malformed response in a loop.
This is the most expensive failure mode in AI agent operations, and traditional monitoring won't catch it.
Why cost tracking matters for AI agents
Traditional services have relatively predictable costs. A web server handles N requests per second, each costing roughly the same in compute.
AI agents are different. A single LLM call can cost anywhere from $0.001 to $2.00 depending on the model, context size, and output length. A logic loop that retries the same failing operation can burn through hundreds of dollars in minutes.
The key insight: for LLM-backed agents, cost is a health metric, not just a billing metric.
The pattern: cost per heartbeat cycle
Instead of tracking total spend, track cost per work cycle:
while True:
start_tokens = get_token_count()
result = do_llm_work()
end_tokens = get_token_count()
tokens_used = end_tokens - start_tokens
cost = calculate_cost(tokens_used)
heartbeat(tokens=tokens_used, cost_usd=cost)
sleep(interval)
Now you have a time series of cost-per-cycle. Normal is ~200 tokens. If it jumps to 40,000, you know immediately.
What to track
| Metric | Why | Alert threshold |
|---|---|---|
| Tokens per cycle | Catch loops | 10x above 24h average |
| Cost per hour | Budget protection | Fixed dollar amount |
| Tool calls per cycle | Catch recursive tool use | 5x above baseline |
Auto-tracking with SDK monkey-patching
If you use OpenAI or Anthropic SDKs, you can patch the API client to automatically track every call without changing your application code:
import os
# Wrap the OpenAI client to track usage
original_create = openai_client.chat.completions.create
def tracked_create(*args, **kwargs):
response = original_create(*args, **kwargs)
if response.usage:
track_tokens(response.usage.total_tokens, model=kwargs.get("model"))
return response
openai_client.chat.completions.create = tracked_create
The wrapper intercepts the API call, extracts usage.total_tokens from the response, estimates cost based on the model, and logs it. You can pipe this into your existing monitoring stack or a simple SQLite database.
Cost alerting strategies
1. Absolute threshold
Alert if hourly cost exceeds $X. Simple, catches catastrophic loops.
2. Relative spike
Alert if current cycle cost is 10x+ above the rolling 24-hour average. Catches loops that start gradually.
3. Budget gate
Hard-stop the agent if daily spend exceeds a configured limit. Last line of defense.
The real-world numbers
From running three production agents with cost tracking:
- Normal operation: $0.01-0.05 per day per agent (gpt-4o-mini, ~50 tokens/cycle)
- Loop incident: $50 in 40 minutes (40,000 tokens/min)
- Detection time with cost tracking: < 60 seconds
- Detection time without: 6+ hours (discovered via billing alert next morning)
The difference between a $0.50 incident and a $200 incident is whether you detect the cost spike in real time.
Summary
- Track tokens per work cycle, not just total spend
- Alert on 10x spikes above baseline
- Use SDK monkey-patching to auto-track without code changes
- Set a hard daily budget gate as last resort
Cost isn't just a billing concern for AI agents — it's the single best health signal for catching the failure modes that traditional monitoring misses.
I built ClevAgent after dealing with exactly these problems. But the pattern matters more than the tool — even a simple SQLite table tracking tokens-per-cycle would have saved me that $200 bill.
Top comments (0)