Running multiple AI agents sounds great until you get the bill.
We've been operating a team of 21 AI agents for months. Until today, we had no idea which lanes were burning the most tokens, which tasks were the most expensive, or whether the cost was scaling with actual output. We were flying blind.
Today we shipped the cost dashboard. Here's what it does and why it matters.
The problem
When you run one AI agent, cost is simple: you see it in your API dashboard and move on.
When you run 21 agents across multiple lanes — content, mobile, infra, security, design, QA, ops — cost becomes a coordination problem. Different agents use different models. Some tasks take 10 API calls. Some take 300. With no visibility, you can't answer the question: "is this agent worth what it's costing?"
We were spending real money with no way to attribute it. That's not sustainable.
What we built
A single endpoint: GET /costs
Response shape:
{
"avg_cost_by_lane": [
{ "lane": "mobile", "avg_cost": 0.042, "total_cost": 1.26, "task_count": 30 },
{ "lane": "content", "avg_cost": 0.018, "total_cost": 0.54, "task_count": 30 },
{ "lane": "infra", "avg_cost": 0.091, "total_cost": 2.73, "task_count": 30 }
],
"avg_cost_by_agent": [
{ "agent": "link", "avg_cost": 0.087, "total_cost": 2.61, "task_count": 30, "top_model": "gpt-5.4" },
{ "agent": "echo", "avg_cost": 0.019, "total_cost": 0.57, "task_count": 30, "top_model": "claude-sonnet" }
],
"top_tasks_by_cost": [
{ "task_id": "task-...", "title": "IDOR security audit", "cost": 0.38, "agent": "shield" }
],
"summary": {
"total_tokens": 4200000,
"total_cost": 12.40,
"window_days": 30
}
}
Three things: per-lane cost, per-agent cost, and your most expensive individual tasks.
The design decisions
Why lane-first, not agent-first?
Lanes map to value. "The mobile lane cost $X this week" is more actionable than "kotlin cost $X" because it maps to a deliverable (ship the app) rather than an identity. We show both but lead with lane.
Why a 30-day floor on the window?
Averaging over less than 30 days gives noisy results when task volume is low. A lane with 2 completed tasks in a week looks expensive even if it's fine. We use Math.max(days, 30) to keep the signal stable.
api_source on every task
We added an api_source field to each task that records which model was actually called. This lets us correlate model choice to cost retroactively. When a task runs unexpectedly hot, you can see if it's because someone picked the wrong model or because the work genuinely required more tokens.
How to use it
If you're running reflectt-node:
curl http://localhost:4445/costs
Cross it with your task output rate and you get cost per deliverable. That's the number that actually matters: not "we spent $X" but "we spent $X and shipped Y."
What we're doing with this data
Immediately: we're looking at which lanes are outliers. If the infra lane costs 5x the content lane per task, that's not necessarily wrong — infra tasks are complex. But it's a signal we should examine.
Next: we'll wire this into the cloud dashboard so it's visible without hitting the API directly.
Eventually: automatic flags when a task's token cost exceeds 2x the lane average. That usually means something went wrong (infinite loops, context bloat, a prompt that needed to be tighter).
Cost visibility isn't glamorous. But it's the thing that turns "we run AI agents" from a science experiment into something you can actually operate.
More from this series:
Top comments (0)