Your OpenAI invoice shows token counts. What it doesn't show is how many of those tokens produced nothing useful. Failed calls, retries, model over-provisioning, and calls that returned an answer nobody used - these are where agent costs actually go, and none of them appear in standard billing dashboards.
Cost Per Call vs Cost Per Successful Outcome
The metric that matters isn't cost per call. It's cost per successful outcome.
If your agent costs $0.008 per call but succeeds 60% of the time, your real cost per successful outcome is $0.013. If a different configuration costs $0.012 per call but succeeds 90% of the time, the real cost is $0.013. They're equivalent on a per-outcome basis, but only one of those configurations surfaces as "expensive" in a naive cost analysis.
This is the measurement problem: optimizing for cost per call without tracking success rate will push you toward cheaper models that fail more, which can increase cost per successful outcome while appearing to save money.
The hidden cost drivers that inflate cost-per-outcome:
Retries: A call that retries 3 times costs 4x the nominal price. If 15% of your calls retry, your effective cost is 1.45x what the dashboard shows.
Failed calls that still consume tokens: Partial responses, malformed JSON that fails schema validation, outputs that pass syntactic checks but fail downstream use - these all bill full token counts.
Model over-provisioning: Using gpt-4o for a task that gpt-4o-mini handles reliably is a 10-15x cost premium. Over-provisioning is often a conservative choice made during development that never gets revisited.
Context window bloat: System prompts that grew organically, conversation history that accumulates without pruning, tool definitions included in every call whether needed or not.
The Token Cost Math
Let's run actual numbers on a classification task that runs 10,000 times per day.
The task: classify a support ticket into one of 12 categories. Input is typically 200-400 tokens (ticket text + system prompt). Output is a JSON blob with the category and confidence score - roughly 40 tokens.
With gpt-4o:
- Input: 300 tokens avg * $5.00/1M = $0.0015 per call
- Output: 40 tokens * $15.00/1M = $0.0006 per call
- Per call: $0.0021
- Daily (10k calls): $21.00
- Monthly: ~$630
With gpt-4o-mini:
- Input: 300 tokens * $0.15/1M = $0.000045 per call
- Output: 40 tokens * $0.60/1M = $0.000024 per call
- Per call: $0.000069
- Daily (10k calls): $0.69
- Monthly: ~$21
That's a 30x cost difference. The question is whether gpt-4o-mini is reliable enough for your classification task. If it succeeds 95% of the time and gpt-4o succeeds 97%, the per-outcome cost still favors gpt-4o-mini by roughly 27x.
Now layer in a synthesis task: summarizing escalated tickets for a human reviewer. Input is 800-1200 tokens, output is 200-300 tokens. This is where model quality meaningfully affects output usefulness.
With gpt-4o-mini for synthesis:
- The summaries are functional but miss nuance. Human reviewers request rewrites 18% of the time.
- Effective cost per good summary: higher than face value due to rework.
With gpt-4o for synthesis:
- Higher per-call cost, but rewrite rate drops to 4%.
- Cost per usable summary is lower despite the higher model price.
Classification and synthesis are different task types with different model sensitivity profiles. A static routing strategy that uses the same model for both is almost certainly wrong.
Cost-Constrained Routing
The goal is to route by expected outcome quality first, with cost as a secondary constraint. This is different from routing by cost first - a cheaper model that fails more doesn't save money, it shifts costs from the API bill to operational overhead and user churn.
import kalibr
from kalibr import Router
def classify_ticket(ticket_text: str, goal_id: str) -> dict:
router = Router(
goal_id=goal_id,
task_type="ticket_classification"
)
# Cost constraint: classification is simple, budget accordingly
policy = router.get_policy(
constraints={"max_cost_usd": 0.002} # Hard ceiling per call
)
import openai
client = openai.OpenAI()
response = client.chat.completions.create(
model=policy.model, # Router selects from eligible models under cost ceiling
messages=[
{
"role": "system",
"content": "Classify the support ticket into one of these categories: "
"billing, technical, account, feature_request, complaint, "
"shipping, returns, warranty, installation, data, security, other. "
"Respond with JSON: {\"category\": \"...\", \"confidence\": 0.0-1.0}"
},
{"role": "user", "content": ticket_text}
],
response_format={"type": "json_object"}
)
result = response.choices[0].message.content
cost = _calculate_cost(policy.model, response.usage)
router.record_outcome(
success=_is_valid_classification(result),
quality_score=_parse_confidence(result),
cost_usd=cost
)
return {"result": result, "model": policy.model, "cost": cost}
def synthesize_summary(ticket_text: str, history: list[str], goal_id: str) -> str:
router = Router(
goal_id=goal_id,
task_type="ticket_synthesis"
)
# Higher budget for synthesis - quality directly affects human reviewer time
policy = router.get_policy(
constraints={"max_cost_usd": 0.05}
)
import openai
client = openai.OpenAI()
messages = [
{
"role": "system",
"content": "Summarize this support ticket and its history for a senior reviewer. "
"Include: core issue, customer sentiment, previous resolution attempts, "
"recommended next action."
},
{"role": "user", "content": f"Ticket: {ticket_text}\n\nHistory:\n" + "\n".join(history)}
]
response = client.chat.completions.create(
model=policy.model,
messages=messages
)
content = response.choices[0].message.content
cost = _calculate_cost(policy.model, response.usage)
router.record_outcome(
success=_is_useful_summary(content),
cost_usd=cost
)
return content
def _calculate_cost(model: str, usage) -> float:
# Prices per 1M tokens (update as pricing changes)
pricing = {
"gpt-4o": {"input": 5.00, "output": 15.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gpt-4-turbo": {"input": 10.00, "output": 30.00},
}
p = pricing.get(model, {"input": 5.00, "output": 15.00})
return (usage.prompt_tokens * p["input"] + usage.completion_tokens * p["output"]) / 1_000_000
The max_cost_usd constraint tells the router which models are eligible. Within the eligible set, Thompson Sampling selects based on historical outcome rates for that task type. A model that costs $0.0018 per call but succeeds 91% of the time beats a model that costs $0.0015 per call and succeeds 80% of the time - but both beat the unconstrained gpt-4o at $0.021 per call if your classification task doesn't need that capability.
The Trust Invariant
Cost constraints work correctly only if the routing system is already optimizing for outcomes. If the router is outcome-first - selecting models based on historical success rates, then filtering by cost - then a cost ceiling safely removes expensive models without degrading reliability below the constraint.
If cost is the primary optimization target, the system will find the cheapest model that passes some minimum bar. That minimum bar is usually defined by your development-time evals, which is a different distribution than production traffic. The result: cost goes down in testing, cost-per-outcome goes up in production.
Kalibr's router is outcome-first. The max_cost_usd constraint is a filter applied after candidate models are ranked by expected outcome quality. You're not asking "what's the cheapest model that might work?" You're asking "among the models that are likely to work, which ones fit my budget?"
This is the difference between routing and price shopping.
Measuring the Hidden Costs
To actually reduce cost per outcome, you need to measure the components that your billing dashboard hides:
import kalibr
# After a week of tracked outcomes
cost_report = kalibr.get_insights(
lookback_hours=168,
include_cost_breakdown=True
)
print(f"Total spend: ${cost_report.total_cost_usd:.2f}")
print(f"Successful outcomes: {cost_report.successful_outcomes}")
print(f"Failed calls (billed): {cost_report.failed_calls_count}")
print(f"Cost on failed calls: ${cost_report.failed_call_cost_usd:.2f}")
print(f"Retry overhead: {cost_report.retry_multiplier:.2f}x")
print(f"Cost per successful outcome: ${cost_report.cost_per_success:.4f}")
for task_type in cost_report.by_task:
print(f"\n{task_type.name}:")
print(f" Success rate: {task_type.success_rate:.1%}")
print(f" Avg cost/call: ${task_type.avg_cost_per_call:.5f}")
print(f" Cost/success: ${task_type.cost_per_success:.5f}")
print(f" Over-provisioning flag: {task_type.uses_higher_tier_than_needed}")
The over_provisioning_flag is set when the router's historical data shows that cheaper models perform equivalently well on a task type but a more expensive model is being selected. This surfaces optimization opportunities that aren't visible in aggregate cost numbers.
What Changes When You Measure Correctly
The companies that reduce LLM costs meaningfully aren't the ones that found a cheaper API. They're the ones that measured cost per successful outcome, found where the hidden costs were concentrated, and optimized those. Typically:
- 20-30% of spend is on calls that failed but billed full tokens
- 15-25% is retry overhead from transient failures that weren't rate-limited correctly
- 30-50% is model over-provisioning on task types where a cheaper model performs equivalently
None of these appear as line items on your OpenAI invoice. All of them are measurable if you're tracking outcomes alongside costs.
See When Your AI Agent Should Fail Fast Instead of Retry for how retry decisions interact with cost.
Top comments (0)