This is Part 7 of the "Building Kepion" series — an AI platform that deploys companies from a text description using 31 specialized agents. Start from Part 1.
Last week I ran my first real cost benchmark across all 4 model tiers. The results looked great — too great. MiniMax M2.7 appeared to outperform Claude Opus at 7% of the cost. I almost published that claim.
Then I checked the raw numbers. And found a $12 bug that had been silently inverting every score-per-dollar comparison in my dashboard.
The setup: why cost tracking matters in multi-agent systems
Kepion routes requests to 300+ models through OpenRouter, organized in 4 tiers:
| Tier | Models | Cost/1M tokens |
|---|---|---|
| Free | Llama 3.3 70B | $0 |
| Budget | DeepSeek V3, Gemini Flash | $0.14–0.60 |
| Performance | MiniMax M2.7 | $0.30–1.20 |
| Premium | Claude Sonnet/Opus 4.6 | $3–25 |
The cost intelligence dashboard tracks every API call: which agent, which model, how many tokens, how much it cost, and how long it took. The headline metric is score per dollar — quality score divided by cost. Higher means better value.
This metric drives real decisions. If an agent consistently scores 8.5/10 on a $0.03 call, that's 283 score/$. If Opus scores 9.2/10 on a $0.45 call, that's 20.4 score/$. The cheaper model wins on efficiency — and the system uses this data to auto-downgrade agents that don't need premium models.
The entire 4-tier routing strategy depends on this number being correct.
The bug: input vs output token costs
Here's what happened. OpenRouter charges differently for input tokens and output tokens. For Claude Opus 4.6, the pricing is:
Input: $5.00 / 1M tokens
Output: $25.00 / 1M tokens
That's a 5x difference. For DeepSeek V3:
Input: $0.14 / 1M tokens
Output: $0.28 / 1M tokens
Only a 2x difference.
My cost tracker was calculating cost like this:
def calculate_cost(tokens_in: int, tokens_out: int, model: str) -> float:
price = MODEL_PRICES[model] # single price per 1M tokens
total_tokens = tokens_in + tokens_out
return (total_tokens / 1_000_000) * price
One price. For both input and output. The MODEL_PRICES dictionary stored the input price only.
For a typical agent call — say 2,400 tokens in, 5,100 tokens out — here's what happens:
Opus (actual cost):
Input: 2,400 × $5.00/1M = $0.012
Output: 5,100 × $25.00/1M = $0.1275
Total: $0.1395
Opus (what my tracker calculated):
Total: 7,500 × $5.00/1M = $0.0375
My tracker was reporting $0.04 for a call that actually cost $0.14. It was underreporting Opus costs by 73%.
But for DeepSeek V3, the gap is much smaller — the input/output ratio is only 2x, not 5x. So DeepSeek costs were underreported by maybe 30%.
How this inverts the comparison
The score/$ metric divides quality by cost. When you underreport the expensive model's cost more than the cheap model's cost, the expensive model looks relatively cheaper than it actually is.
Here's the real comparison for a typical architecture task:
| Model | Score | Real Cost | Real Score/$ | Buggy Cost | Buggy Score/$ |
|---|---|---|---|---|---|
| Opus 4.6 | 9.2 | $0.139 | 66.2 | $0.038 | 242.1 |
| MiniMax M2.7 | 8.5 | $0.018 | 472.2 | $0.012 | 708.3 |
With the bug, MiniMax looks 2.9x more cost-efficient than Opus. In reality, it's 7.1x more efficient. The ratio was directionally correct — MiniMax is more cost-efficient — but the magnitude was wrong by a factor of 2.4.
That means every auto-downgrade decision was less aggressive than it should have been. The system was keeping agents on expensive models longer than necessary, because the cost difference looked smaller than it really was.
The $12 impact
Over a week of development and testing, the cumulative error was about $12. The tracker reported $47 in total spending. Actual spending was $59.
$12 doesn't sound like much. But consider:
- It's a 25% undercount. If you're budgeting $200/month for AI costs, you're actually spending $250.
- It compounds at scale. With 100 concurrent businesses, each running 5-10 agent chains per day, that $12/week becomes $120/week — over $6,000/year of invisible cost.
- It corrupts every downstream metric. Cost anomaly detection, circuit breaker thresholds, tier recommendations — all based on wrong numbers.
- Auto-downgrade fires less often. The system thinks Opus is cheap enough to keep using when it should be suggesting M2.7.
The cost circuit breaker has 4 levels:
limits = {
"per_request": 2.00,
"per_agent_hourly": 10.00,
"per_business_daily": 50.00,
"platform_hourly": 100.00
}
With 73% underreporting on premium models, the per-agent hourly limit ($10) wouldn't trigger until actual spending hit $37. A runaway Opus loop could drain $37/hour before anyone noticed.
The fix
Two changes:
# Before: single price
MODEL_PRICES = {
"anthropic/claude-opus-4.6": 5.00,
"deepseek/deepseek-chat-v3": 0.14,
# ...
}
def calculate_cost(tokens_in, tokens_out, model):
price = MODEL_PRICES[model]
return ((tokens_in + tokens_out) / 1_000_000) * price
# After: split input/output pricing
MODEL_PRICES = {
"anthropic/claude-opus-4.6": {"input": 5.00, "output": 25.00},
"anthropic/claude-sonnet-4.6": {"input": 3.00, "output": 15.00},
"deepseek/deepseek-chat-v3": {"input": 0.14, "output": 0.28},
"minimax/minimax-m2.7": {"input": 0.30, "output": 1.20},
"google/gemini-2.5-flash": {"input": 0.15, "output": 0.60},
"meta-llama/llama-3.3-70b": {"input": 0.0, "output": 0.0},
}
def calculate_cost(tokens_in, tokens_out, model):
prices = MODEL_PRICES[model]
input_cost = (tokens_in / 1_000_000) * prices["input"]
output_cost = (tokens_out / 1_000_000) * prices["output"]
return input_cost + output_cost
And a retroactive recalculation that walks the audit trail and recalculates every historical cost entry. The JEP audit log stores raw token counts per call, so the data was never lost — just the derived cost was wrong.
async def recalculate_all_costs():
"""Walk audit trail. Recalculate cost for every logged API call."""
entries = await get_all_audit_entries()
corrections = 0
total_delta = 0.0
for entry in entries:
old_cost = entry["cost_usd"]
new_cost = calculate_cost(
entry["tokens_in"],
entry["tokens_out"],
entry["model"]
)
if abs(new_cost - old_cost) > 0.001:
await update_cost(entry["id"], new_cost)
total_delta += (new_cost - old_cost)
corrections += 1
return {
"entries_checked": len(entries),
"corrections": corrections,
"total_delta_usd": round(total_delta, 2)
}
Result: 847 entries checked, 312 corrections, +$12.34 total delta.
The deeper problem: trusting your own dashboard
This bug was invisible. The dashboard showed numbers. The numbers looked reasonable. Nobody questioned them — because cost tracking is one of those things you build once and assume works.
But there's a pattern here. In the AI agent space, the metrics you're optimizing against are the ones you built yourself. Unlike web applications where you can verify behavior against a browser, or databases where you can SELECT COUNT(*) and check — cost tracking in multi-model systems has no external ground truth in real-time.
I only caught this because I manually compared one day's OpenRouter invoice against my dashboard. They didn't match. Then I traced it backward.
Three rules I now follow
1. Never store a single price per model. Every LLM provider charges differently for input and output. Some (like DeepSeek) also have different rates for cache hits. If your cost tracker uses one number, it's wrong.
2. Cross-check against the provider invoice. Once a week, pull the actual bill from OpenRouter (or Anthropic, or wherever). Compare total against your tracker total. If they differ by more than 5%, you have a bug.
3. Test cost calculations with known fixtures. I added a test that sends a known prompt (fixed token count) to each model tier, checks the returned usage.prompt_tokens and usage.completion_tokens, calculates expected cost, and asserts it matches the dashboard within 1%. This runs in CI. If OpenRouter changes pricing or adds a new model — the test fails and I know before it hits production.
def test_cost_accuracy_opus():
"""Known fixture: 100 tokens in, 200 tokens out on Opus."""
expected = (100 / 1e6) * 5.00 + (200 / 1e6) * 25.00
actual = calculate_cost(100, 200, "anthropic/claude-opus-4.6")
assert abs(actual - expected) < 0.0001
def test_cost_accuracy_deepseek():
expected = (100 / 1e6) * 0.14 + (200 / 1e6) * 0.28
actual = calculate_cost(100, 200, "deepseek/deepseek-chat-v3")
assert abs(actual - expected) < 0.0001
Simple tests. But they would have caught this bug on day one.
The uncomfortable truth about AI cost claims
Every AI platform makes cost efficiency claims. "90% cheaper than GPT-4." "Run your agents for pennies." I almost published a comparison showing MiniMax at 2.9x the efficiency of Opus, when the real number is 7.1x.
If my cost tracker was wrong, how many other platforms have the same bug? How many "cost savings" claims are based on input-price-only calculations?
If you're building with multiple LLMs and tracking costs — check your math. Specifically:
- Are you using split input/output pricing?
- Does your tracker account for cache hit discounts?
- Have you compared your internal numbers against your provider's actual invoice?
- Do you have a CI test that validates cost calculation against known token counts?
If the answer to any of these is no, your cost dashboard might be telling you a comfortable lie.
What's next
Next week: Fixture Validation — The Silent Killer of AI Benchmarks. A benchmark that passes all assertions but never actually called the API. How it happened, and the validation framework that prevents it.
Follow the build: GitHub | kepion.app
Have you run into cost tracking bugs in multi-model setups? I'm curious — do you track costs per-call, or just check the monthly invoice? And do you split input/output pricing, or use a blended rate?
Tags: #buildinpublic #ai #agents #costoptimization
Top comments (0)