DEV Community

George Belsky
George Belsky

Posted on

You Deployed 30 AI Agents. Can You Answer These 5 Questions About Them?

Your company has 30 AI agents in production. The data analyst agent runs SQL queries. The report generator writes weekly summaries. The code reviewer comments on PRs. The customer support agent handles tickets.

They all work. Individually.

Now answer these five questions:

  1. Which agents are running right now?
  2. How much has each agent spent today?
  3. Has any agent used a tool it shouldn't have?
  4. Can you shut down a specific agent in under 10 seconds?
  5. What did each agent do in the last 24 hours?

If you can't answer all five, you don't have governance. You have 30 independent processes running in the dark.

Why This Matters at Agent #10

Teams with 1-3 agents don't feel this pain. You know where they run. You check the OpenAI dashboard manually. You grep the logs when something breaks.

At 10 agents, cracks appear. An agent starts burning tokens on a loop. You don't notice for 3 hours. The monthly bill spikes. Nobody knows which agent caused it.

At 30 agents, it's chaos. Different teams own different agents. Different frameworks (LangGraph, CrewAI, AutoGen). Different models (GPT-4o, Claude, Gemini). Different machines. The report-writing agent has access to the delete_table function because nobody set up tool permissions. The code reviewer agent hit a bug and has been retrying the same API call for 6 hours.

This is the governance gap. The agents work. Nobody governs them.

What Governance Actually Looks Like

Governance for AI agents is not a single feature. It's five capabilities working together:

1. Agent Registry

Every agent registers with metadata: what team owns it, what framework it uses, what model it runs, what environment it's deployed in.

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

client.send_intent({
    "intent_type": "intent.governance.register_agent.v1",
    "to_agent": "agent://myorg/production/data-analyst",
    "payload": {
        "agent_address": "data-analyst",
        "display_name": "Data Analyst Agent",
        "metadata": {
            "team": "analytics",
            "framework": "langchain",
            "model": "gpt-4o",
            "environment": "production",
        },
        "policies": {
            "cost_cap_usd": 50.0,
            "allowed_tools": ["sql_query", "chart_generate", "export_csv"],
            "require_approval_above_usd": 25.0,
        },
    },
})
Enter fullscreen mode Exit fullscreen mode

Now you have an inventory. You know what's deployed, who owns it, and what rules it follows.

2. Health Monitoring

Every agent sends heartbeats. If an agent misses 3 heartbeats, it's flagged as unhealthy. No more discovering failures from customer complaints.

client.send_intent({
    "intent_type": "intent.governance.heartbeat.v1",
    "to_agent": "agent://myorg/governance/monitor",
    "payload": {
        "agent_address": "data-analyst",
        "status": "healthy",
        "metrics": {
            "requests_total": 142,
            "avg_latency_ms": 1200,
            "cost_usd": 12.50,
            "memory_mb": 312,
        },
    },
})
Enter fullscreen mode Exit fullscreen mode

3. Cost Caps and Tool Permissions

Each agent has a cost cap and a tool allowlist. The policy enforcer watches heartbeats and blocks violations in real time.

  • Data analyst: $50/day cap, can only use sql_query, chart_generate, export_csv
  • Report generator: $30/day cap, can only use read_file, write_report, send_email
  • Code reviewer: $100/day cap, can only use read_repo, post_comment, approve_pr

When the report generator tries to call delete_table: blocked, logged, alert sent. When the code reviewer hits $80 of its $100 cap: warning. When it hits $100: kill switch.

4. Kill Switch

One command shuts down a single agent or the entire fleet.

# Kill one agent
python kill_switch.py --agent data-analyst --reason "cost cap exceeded"

# Kill everything
python kill_switch.py --all --reason "security incident"
Enter fullscreen mode Exit fullscreen mode

The kill intent is durable. If the agent is temporarily unreachable, the intent waits in the platform and delivers when the agent reconnects. You don't need SSH access. You don't need to find the PID. You don't need to know which machine the agent is on.

5. Audit Trail

Every governance event is logged: registrations, heartbeats, policy violations, tool blocks, kill switch activations. When the CEO asks "what happened yesterday?", you have the answer.

[2026-03-31T14:20:12Z] cost_warning
  Agent:  gov-report-generator
  Cost:   $24.50 / $30.00

[2026-03-31T14:21:45Z] tool_blocked
  Agent:  gov-data-analyst
  Tool:   delete_table
  Allowed: ['sql_query', 'chart_generate', 'export_csv']

[2026-03-31T14:22:08Z] kill_switch_activated
  Agents: [data-analyst, report-generator, code-reviewer]
  Reason: security incident
  Operator: admin
Enter fullscreen mode Exit fullscreen mode

The Dashboard

All five capabilities feed into a real-time fleet dashboard at mesh.axme.ai:

Agent Mesh Dashboard

Health, cost, latency, policy compliance - all in one view. No spreadsheets. No log parsing. No monthly invoice surprises.

Policies - cost caps, tool permissions, rate limits - are managed from the same interface:

Policies

What This Replaces

Without a governance platform, teams build these pieces ad hoc:

  • Health monitoring: custom cron job pinging each agent
  • Cost tracking: parse OpenAI/Anthropic invoices at month end
  • Tool permissions: trust that developers configured it correctly
  • Kill switch: SSH into the server, find the PID, kill -9
  • Audit trail: grep CloudWatch logs across 12 services
  • Dashboard: spreadsheet updated weekly by hand

That's 6 systems, built separately, maintained by different teams, with no shared view. AXME replaces all of it with one governance layer.

Framework-Agnostic

This works with any agent framework. AXME governance wraps around your existing agents - it doesn't replace them.

Your LangGraph agent keeps its graph. Your CrewAI crew keeps its tasks. Your AutoGen agents keep their conversations. AXME adds the governance layer on top: register, heartbeat, obey policies, accept kill switch.

The agents don't need to know about each other. The governance platform knows about all of them.

Try It

Full working example with fleet registration, heartbeat monitoring, policy enforcement, kill switch, audit trail, and dashboard:

github.com/AxmeAI/ai-agent-governance-platform

Built with AXME - governance and coordination infrastructure for production AI agents. Alpha - feedback welcome.

Top comments (0)