Last Tuesday at 2am, an agent burned through $400 in OpenAI credits. Nobody noticed until the invoice arrived.
It was a research agent. One of about 40 running across three clouds. Someone had deployed it with a retry loop that never backed off. It hit rate limits, waited, retried, hit limits again -- for 11 hours straight.
The team lead asked a simple question: "How many agents do we have running right now, and what are they doing?"
Nobody could answer.
The Spreadsheet Phase
Every team goes through this. You start with one agent. Then five. Then someone on another team builds three more. The ML team deploys a batch of data processors. The support team launches a customer-facing bot.
Pretty soon you have 30-50 agents. And the "monitoring" looks like this:
- AWS agents: check CloudWatch (maybe)
- GCP agents: check Cloud Logging (different tab)
- The one on a VM somewhere: SSH in and grep the logs
- The one Dave built: ask Dave
Someone creates a spreadsheet. It's outdated by Thursday.
This isn't a tooling problem. It's an architecture problem. Each agent is a standalone process with its own logging, its own metrics, its own way of reporting status. There's no shared contract for "I'm alive" or "I cost $X today."
What a Fleet Dashboard Actually Needs
Think about what you'd want on a single screen:
Identity. Every agent has a name, a team, a cloud, a framework. You need to search and filter by all of these.
Health. Not "the container is running" -- that's Kubernetes' job. You need "the agent is actually processing work." Heartbeat-based, not log-based. If the heartbeat stops, the agent is dead. Simple.
Cost. Per-agent LLM spend. Not "our total OpenAI bill was $X" -- that's useless. You need "agent research-03 spent $47 today, which is 3x its normal rate." Token counts, model breakdown, hourly trends.
Kill switch. When an agent goes rogue -- burning money, stuck in a loop, producing garbage -- you need to stop it. Not "SSH into the machine and find the process." Click a button.
Policy. Rate limits. Spending caps. "If this agent spends more than $50/day, throttle it." Not after the fact. In real time.
The Registration Pattern
The key insight is simple: every agent reports in.
from axme import AxmeClient, AxmeClientConfig
client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))
client.register_agent({
"agent_id": "data-pipeline-01",
"agent_type": "data_processor",
"framework": "langgraph",
"cloud": "gcp",
"team": "data-eng",
})
client.start_heartbeat(interval_seconds=30)
That's it. Five lines. The agent now appears in the fleet dashboard. Its health is tracked via heartbeat. Its cost is tracked via SDK instrumentation. If the heartbeat stops, the dashboard shows it as dead. If you click Kill, the agent receives a shutdown intent.
The same pattern works in TypeScript:
import { AxmeClient } from "@axme/sdk";
const client = new AxmeClient({ apiKey: process.env.AXME_API_KEY });
await client.registerAgent({
agentId: "support-bot-prod",
agentType: "customer_support",
framework: "openai-agents",
cloud: "aws",
team: "support",
});
await client.startHeartbeat({ intervalSeconds: 30 });
What You See
The dashboard at mesh.axme.ai shows your entire fleet:
Filter by status. Filter by cloud. Filter by team. Search by name. Click on any agent to see its cost breakdown, heartbeat history, and active intents.
Dead agents show exactly when the last heartbeat arrived. No log diving. No guessing.
The Kill Switch
Here's where it gets practical. That $400 research agent from Tuesday? With a fleet dashboard, it goes like this:
- Cost alert fires: "research-agent-07 spent $50 in the last hour (10x normal)"
- You open the dashboard. See the agent. See its cost spike in the chart.
- Click Kill.
- The agent receives a shutdown intent via AXME. It stops.
Or from the CLI:
# Kill from dashboard: mesh.axme.ai -> select agent -> Kill
Total time: 30 seconds. Not 11 hours.
You can also set this up as policy, so it's automatic:
# If any agent spends more than $100/day, throttle it
axme mesh policy set --max-daily-cost 100 --action throttle
# If any agent misses 5 heartbeats, alert the team
axme mesh policy set --max-missed-heartbeats 5 --action alert
Framework Doesn't Matter
This is the part that makes fleet management actually work at scale. The dashboard doesn't care what framework your agents use. LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Google ADK, Pydantic AI, raw Python -- they all register the same way and appear in the same dashboard.
Your data team uses LangGraph. Your support team uses OpenAI Agents SDK. Your ML team wrote raw Python. They all show up in one place.
Because the contract is the heartbeat, not the framework.
The Hard Part Nobody Talks About
Building a dashboard UI is easy. The hard part is the lifecycle model underneath.
What happens when an agent crashes? The heartbeat stops, and the status goes to "dead." But does someone get notified? Is there automatic restart? Does the dashboard show why it died?
What happens when you kill an agent? Is it a hard kill (process termination) or a graceful shutdown (finish current work, then stop)? What if the agent ignores the kill signal?
What about agents that run as batch jobs? They start, process a batch, and exit. Are they "dead" between batches?
These are coordination problems, not dashboard problems. The dashboard is just the view layer. The real work is in the agent mesh underneath -- registration, heartbeat protocol, intent delivery, lifecycle state machine.
AXME handles this as part of the agent mesh layer. Agents register. The mesh tracks their lifecycle. The dashboard renders the state. The kill switch sends intents through the same delivery mechanism that agents use for everything else.
Try It
Working example with multi-cloud agent registration, heartbeat, cost tracking, and fleet commands:
github.com/AxmeAI/ai-agent-fleet-dashboard
Built with AXME -- agent coordination infrastructure with durable lifecycle. Alpha -- feedback welcome.

Top comments (0)