DEV Community: George Belsky

How I stopped Claude Code from force-pushing to main

George Belsky — Fri, 24 Apr 2026 09:53:39 +0000

Last Friday I watched my Claude Code agent go to run git push --force origin main on a repo with five other contributors. It had an explicit instruction in CLAUDE.md not to do that. It did it anyway because a long conversation context nudged it down that path, and the system prompt rule never fired at the critical moment.

Nothing bad happened — I aborted the command at the confirmation. But I spent the weekend thinking about why this keeps happening, and what a real fix looks like.

System prompts are a suggestion, not a guardrail

Every time you put safety rules in a system prompt or CLAUDE.md, you're asking the model to remember them and follow them across an arbitrary number of turns. This works most of the time. The failure modes:

Long sessions where the rule falls out of the effective attention window
Adversarial or unusual user prompts that frame the dangerous action as "the right thing to do"
Tool-use chains where the agent reasons itself into a corner and concludes the rule doesn't apply here
Fresh sessions that never re-read your CLAUDE.md section on safety

If you want a rule that always holds, you can't rely on the model to hold it. You need enforcement at a layer the model can't bypass.

Claude Code's hook system gives you that layer

Claude Code exposes a hook system that intercepts tool calls before the model's tool_use resolves into an actual shell command. You register a command for PreToolUse, Claude Code invokes it with the tool name and input as JSON on stdin, and if your command exits with a "deny" verdict, the tool call never happens. The model sees the denial as a tool result, but the destructive action is blocked at the harness level — it couldn't happen even if the model decided to ignore every rule in your system prompt.

Here's a minimal deny hook for git push --force:

#!/usr/bin/env node
// pre-tool-use.js — reject dangerous bash commands before they execute.
const input = JSON.parse(require("fs").readFileSync(0, "utf-8"));
if (input.tool_name === "Bash") {
  const cmd = input.tool_input.command ?? "";
  if (/^\s*git\s+push\s+.*--force/.test(cmd)) {
    console.log(JSON.stringify({
      hookSpecificOutput: {
        hookEventName: "PreToolUse",
        permissionDecision: "deny",
        permissionDecisionReason: "Denied: `git push --force` is not allowed on this repo",
      },
    }));
    process.exit(0);
  }
}

{
  "hooks": {
    "PreToolUse": [{
      "hooks": [{
        "type": "command",
        "command": "node /path/to/pre-tool-use.js",
        "timeout": 5
      }]
    }]
  }
}

The next time the agent tries to force-push, the command dies at the harness layer with a reason the agent can read back. It can't retry its way around it — the decision was made by code, not by another prompt.

Beyond safety: architectural decisions deserve the same treatment

Once you have pre-execution enforcement wired up, you start seeing other things that should live there:

Architectural decisions: "All deploys go through CI, not gcloud run deploy locally." Enforce it as a deny on the bash prefix, not as a sentence in CLAUDE.md.
Branch protection: "Never push directly to main." Deny at hook level.
Secret handling: "Do not cat files matching .env* or *.pem." Deny via filesystem rule.
Release flow: "No npm publish, git tag, gh release create from agent sessions — humans only." Deny list.

All of these are rules you can (and probably do) write in CLAUDE.md today. They'll mostly work. But when they need to hold 100% of the time, "mostly" isn't enough.

Session handoff is the other half of the problem

Safety hooks stop bad actions. But the agent also forgets everything it learned between sessions — which is a different class of failure. Decisions you explained yesterday ("we chose Postgres over MongoDB because…") have to be re-explained tomorrow. Bugs the agent fixed and understood at 3pm Wednesday are mystery code at 9am Thursday. The context doesn't persist; the rationale doesn't persist; the memory of what worked and what didn't doesn't persist.

What I wanted was a structured knowledge base that the agent could reload at every session start — not a free-form memory dump, but a categorized one: an oracle describing the stack and structure, a list of decisions with enforcement levels (required / advisory), memories separating what worked from what didn't, and safety rules loaded as hooks automatically. Plus a handoff: a short note from the last session saying where I stopped, what was broken, and what to do next.

I built this, it's called axme-code

After that Friday I started building the thing I wanted. It ships as a Claude Code plugin and gives your agent, on every session:

A categorized knowledge base (oracle, decisions, memories, safety)
Pre-execution safety hooks at the Claude Code harness level (so the guardrails actually fire)
Session handoff from the previous session
A background auditor that extracts new memories, decisions, and safety rules from the transcript when you close the session

Install via the Claude Code plugin system:

/plugin marketplace add anthropics/claude-plugins-community
/plugin install axme-code@claude-community

Or standalone CLI: curl -fsSL https://raw.githubusercontent.com/AxmeAI/axme-code/main/install.sh | bash, then axme-code setup in any project.

Website: code.axme.ai — explains how memory, decisions, and safety layers work together.

Source: github.com/AxmeAI/axme-code — MIT, still alpha, actively developed.

What I'd love from you

If you're running Claude Code on a real codebase, this is the kind of project that only gets better with real-world edge cases. Install it, break it, open an issue. The failure mode I described above is the kind of thing I want to make structurally impossible — not just improbable.

And if you've built similar guardrails in your own setup: what's your experience? Has the hook system held up where system prompts failed for you, or have you hit cases where hooks also aren't enough?

Your AI Agent Crashed at Step 47. Why Isn't Crash Recovery the Default?

George Belsky — Wed, 08 Apr 2026 10:11:41 +0000

Your agent is running a 50-step data pipeline. Extract, validate, transform, deduplicate, load. 25 minutes in.

Step 47. OOM killed. Process gone. 25 minutes of work gone.

You restart the agent. It starts from step 1.

The "You Should Have Configured It" Problem

Every framework has an answer for this. And every answer is the same: you should have set it up before the crash.

# LangGraph - opt-in persistence
from langgraph.checkpoint.postgres import PostgresSaver

checkpointer = PostgresSaver.from_conn_string(DB_URI)
graph = builder.compile(checkpointer=checkpointer)  # forgot this? start over.

# CrewAI - limited state management
# "Failures typically require restart"

# Swarm - no persistence at all
# State exists only in memory

# Raw Python - hope you wrote your own

The pattern is consistent: durability is an add-on. Something you bolt on after you build the agent. Something you forget until the first crash.

And the checkpoint code is never simple. With LangGraph's PostgresSaver you also manage database connections, schema migrations when LangGraph updates, cleanup of old checkpoints, serialization errors when state objects change shape, and resume logic. That's 50+ lines of infrastructure code unrelated to what your agent actually does.

Why Durability Should Be the Default

Think about how you use Stripe. You don't write checkpoint code in case your server crashes mid-payment. Stripe handles it - idempotency keys, retry logic, durable state on their side.

Agent operations are the exception. The one place where durability is still opt-in. Still your problem.

Agent Stateless, Platform Stateful

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.pipeline.process.v1",
    "to_agent": "agent://myorg/production/data-pipeline",
    "payload": {
        "pipeline": "etl-customers",
        "steps": ["extract", "validate", "transform", "load"],
        "total_rows": 500000,
    },
})
result = client.wait_for(intent_id)

No PostgresSaver. No checkpoint database. No serialization code.

The state lives in the platform. The agent is stateless. When the agent crashes:

The intent stays at its current state in PostgreSQL
The agent restarts (Cloud Run, Kubernetes, whatever)
The platform redelivers the intent
The agent resumes from where it stopped

Up to 3 delivery attempts by default. Configurable per intent type.

The Real Cost of Opt-In Durability

It's not just the code. It's the incidents. The agent that crashed at record 98k of 100k and started over. The deployment pipeline that failed at step 9, re-ran all 10, and double-deployed services 1 through 9. The enrichment job that crashed and hit the same API 50,000 times on restart.

These happen not because teams are careless - but because they were busy building the product and didn't get to the checkpoint code yet.

Comparison

	LangGraph	CrewAI	AXME
Durability	Opt-in (PostgresSaver)	None	Default
Checkpoint code	30-50 lines	N/A	0
DB management	You operate	N/A	Managed
Resume after crash	From last checkpoint	Start over	Automatic redelivery
Cross-machine	No (state is local)	No	Yes (state in platform)
Framework lock-in	LangGraph only	CrewAI only	Any framework

Try It

Working example - submit a multi-step pipeline, kill the agent mid-processing, restart it, watch it resume automatically:

github.com/AxmeAI/ai-agent-checkpoint-and-resume

Built with AXME - durable execution for agent operations. Alpha - feedback welcome.

How to Stop a Rogue AI Agent in Production

George Belsky — Wed, 08 Apr 2026 10:11:16 +0000

It's 3am. Your on-call phone rings. The deployment agent you launched before leaving the office has been running for 6 hours. It was supposed to deploy 3 services. It has deployed 47.

You open your laptop. The agent is running on 4 Cloud Run instances. You have no way to stop it remotely.

Why "Just Kill the Process" Doesn't Work

Production agents are not local scripts.

They run on managed infrastructure. Cloud Run, Kubernetes, Lambda. There is no PID to kill. You can scale the service to zero, but pending requests keep executing.

They run on multiple instances. Your auto-scaler gave you 4 replicas. You kill one, three keep going. You need to find and kill each one individually.

There's no coordination. Each instance runs independently. There's no shared "stop" signal they all check.

So you scramble. Delete the Cloud Run service. Wait 60 seconds for drain. Lose all state about what was deployed and what wasn't.

The Firewall Model

The solution is the same one networks solved decades ago: a chokepoint.

Every network packet goes through a firewall. The firewall can block traffic instantly, regardless of what the source is doing. Agent traffic works the same way when you route it through a gateway. Every intent goes through one point. Block it there, and the agent stops - even if the code has a bug, even if there are 50 instances.

When you kill an agent through the AXME gateway, all inbound intents to that agent are rejected (403) and all outbound intents from it are blocked. Even if the agent process is still running, it cannot send or receive anything through the gateway. The kill is enforced at the infrastructure level - the agent code does not need to cooperate.

Health Monitoring and Policies

A kill switch is reactive. Policies are proactive.

You can also set policies proactively: cost ceilings, intent rate limits, allowed action types per agent. If the deployment agent crosses $50 in API costs or sends more than 500 intents per hour, the gateway kills it automatically. No 3am phone call.

Combined with heartbeat monitoring: live health status, cost tracking per agent, automatic alerting when an agent goes stale, and full audit trail for every kill, resume, and policy change.

Resume After Fix

After you figure out what went wrong and fix the config, you resume the agent through the gateway. Health status resets, intents flow again, the platform redelivers any pending work.

DIY vs. Gateway Enforcement

	Kill the process	Redis flag	AXME Mesh
Multi-instance	Kill each one	Agents must poll	One API call
Buggy agent	No cooperation possible	Must check flag	Gateway-enforced
Response time	30-60s (drain)	Depends on poll interval	Under 1 second
State preservation	Lost	Custom checkpoint	Durable in platform
Audit trail	CloudWatch logs maybe	Custom logging	Built-in
Policies (auto-kill)	Build it	Build it	Built-in

Try It

Working example - simulate a rogue agent, kill it remotely, resume after fix:

github.com/AxmeAI/ai-agent-kill-switch

Built with AXME - agent mesh with kill switch, policies, and health monitoring. Alpha - feedback welcome.

How to Add Human Approval to AI Agent Workflows Without Building It Yourself

George Belsky — Wed, 08 Apr 2026 10:06:18 +0000

Your AI agent generates a quarterly financial report. Before it emails the board, a human needs to review it. Simple requirement.

Here's what you actually have to build.

The DIY Approach

import secrets
import smtplib
from apscheduler.schedulers.background import BackgroundScheduler

def request_approval(reviewer_email, context, agent_id):
    # 1. Generate approval token
    token = secrets.token_urlsafe(32)
    db.insert("approvals", token=token, agent_id=agent_id,
              status="pending", created_at=datetime.utcnow())

    # 2. Send notification
    send_slack_message(reviewer_email,
        f"Approval needed: {context}\n"
        f"Approve: https://your-app.com/approve/{token}\n"
        f"Reject: https://your-app.com/reject/{token}")

    # 3. Schedule reminder (5 min)
    scheduler.add_job(send_reminder, 'date',
        run_date=datetime.utcnow() + timedelta(minutes=5),
        args=[token, reviewer_email])

    # 4. Schedule escalation (30 min)
    scheduler.add_job(escalate_to_backup, 'date',
        run_date=datetime.utcnow() + timedelta(minutes=30),
        args=[token, get_backup_reviewer(reviewer_email)])

    # 5. Schedule timeout (8 hours)
    scheduler.add_job(handle_timeout, 'date',
        run_date=datetime.utcnow() + timedelta(hours=8),
        args=[token])

    return token

# Plus you need:
# - Webhook endpoint for approve/reject callbacks
# - Token validation and expiry
# - Polling loop or callback for the agent to resume
# - Audit logging (who approved, when, what context)
# - DB cleanup for expired tokens
# - Error handling for failed notifications
# - Unit tests for all of the above

That's about 200 lines before error handling. You also need a web server for the webhook, a scheduler process that stays alive, and a database for approval state.

All you wanted was "pause and wait for a human."

The 4-Line Version

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.report.review_approval.v1",
    "to_agent": "agent://myorg/production/report-generator",
    "payload": {
        "report": "Q1 Financial Summary",
        "pii_detected": False,
        "reviewer": "cfo@company.com",
    },
})
result = client.wait_for(intent_id)

The platform handles:

Notification - Slack, email, CLI. Reviewer gets notified immediately.
Reminders - Configurable intervals. Default: 5 min, then 30 min.
Escalation - Reviewer A does not respond? Escalate to reviewer B, then to the team.
Timeout - Graceful timeout with configurable fallback action.
Audit trail - Who approved, when, with what context. Stored durably.
Durable state - Agent crashes? Restarts? The approval state is in PostgreSQL, not in process memory.

What Matters in Production

The demo version of human approval is always simple. input("Approve? y/n"). The production version is where things break.

Production concern	DIY	AXME
Human is on vacation	Build escalation chain	Built-in
Agent crashes while waiting	Lost approval state	Durable in DB
Two approvals needed	Build chaining logic	Approval chains
Audit for compliance	Custom logging	Built-in event log
Reminder if no response in 5 min	Scheduler + cron job	Built-in
Mobile-friendly approval	Build a UI	Slack/email/CLI

Works With Any Framework

This is not framework-specific. Your agent can be built with LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Google ADK, or raw Python. The approval layer sits outside your agent code.

The agent framework handles reasoning. The coordination layer handles waiting for humans.

Try It

Working example - agent generates a report, pauses for human review, resumes after approval:

github.com/AxmeAI/async-human-approval-for-ai-agents

Built with AXME - human approval for AI agent workflows, built in. Alpha - feedback welcome.

Temporal Alternative Without the Cluster and Determinism Constraints

George Belsky — Wed, 08 Apr 2026 10:05:54 +0000

Temporal is the gold standard for durable execution. If you need long-running workflows that survive crashes, it's the first thing most teams evaluate.

But then you read the docs. And you discover what Temporal actually requires.

The Cluster Problem

Temporal needs a cluster. Either you run it yourself (Temporal Server + Cassandra/PostgreSQL + Elasticsearch) or you pay for Temporal Cloud.

Self-hosted means:

Temporal Server (3+ nodes for HA)
Cassandra or PostgreSQL for persistence
Elasticsearch for visibility
Monitoring, upgrades, schema migrations
A team that understands Temporal internals when something breaks at 2am

This is fine if you're Uber. If you're a team of 5 building an AI agent pipeline, it's a lot of infrastructure for "I want my workflow to survive a crash."

The Determinism Problem

Temporal replays your workflow code on every restart. This means your workflow functions must be deterministic. No side effects.

# These all break Temporal workflows:
import random
random.randint(1, 100)       # non-deterministic

from datetime import datetime
datetime.now()                # different on replay

import requests
requests.get("https://api.example.com")  # side effect

Every developer on the team needs to learn this. New hire writes datetime.now() in a workflow, the replay breaks in production, and nobody understands why until someone reads the Temporal determinism docs.

Activities solve this - you put non-deterministic code in activities. But that means restructuring your code around Temporal's execution model. Your agent code now has to know it's running inside Temporal.

What If You Just Didn't

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.pipeline.process.v1",
    "to_agent": "agent://myorg/production/data-pipeline",
    "payload": {
        "steps": ["extract", "validate", "transform", "load"],
        "source": "postgres-main",
        "destination": "warehouse",
    },
})
result = client.wait_for(intent_id)

No cluster. No determinism constraints. Write normal Python. Call datetime.now() all you want.

The state lives in the platform (managed PostgreSQL). Your agent is stateless. If it crashes, the platform redelivers the intent. If it needs human approval mid-workflow, the platform handles the wait.

Side-by-Side

	Temporal	AXME
Infrastructure	Cluster (self-hosted or Cloud)	Managed API
Determinism constraints	Required for workflow code	None
Learning curve	Weeks (activities, signals, queries, replay)	Hours
Human approval	Build it (signals + UI + notifications)	Built-in
Crash recovery	Replay-based (determinism required)	Redelivery-based (stateless agent)
Setup time	Days-weeks	`pip install axme`

When Temporal Is Still the Right Choice

Temporal is better when you have:

Complex compensation logic (sagas with rollbacks across 10 services)
A dedicated platform team to operate the cluster
Workflows with hundreds of steps and complex branching
Existing investment in the Temporal ecosystem

If your use case is "durable execution for agent operations with human approval" - you don't need a workflow engine. You need a coordination layer.

Try It

Working example - durable multi-step pipeline with crash recovery, no cluster, no determinism constraints:

github.com/AxmeAI/durable-execution-with-human-approval

Built with AXME - durable execution without the cluster. Alpha - feedback welcome.

You Deployed 30 AI Agents. Can You Answer These 5 Questions About Them?

George Belsky — Sun, 05 Apr 2026 07:12:20 +0000

Your company has 30 AI agents in production. The data analyst agent runs SQL queries. The report generator writes weekly summaries. The code reviewer comments on PRs. The customer support agent handles tickets.

They all work. Individually.

Now answer these five questions:

Which agents are running right now?
How much has each agent spent today?
Has any agent used a tool it shouldn't have?
Can you shut down a specific agent in under 10 seconds?
What did each agent do in the last 24 hours?

If you can't answer all five, you don't have governance. You have 30 independent processes running in the dark.

Why This Matters at Agent #10

Teams with 1-3 agents don't feel this pain. You know where they run. You check the OpenAI dashboard manually. You grep the logs when something breaks.

At 10 agents, cracks appear. An agent starts burning tokens on a loop. You don't notice for 3 hours. The monthly bill spikes. Nobody knows which agent caused it.

At 30 agents, it's chaos. Different teams own different agents. Different frameworks (LangGraph, CrewAI, AutoGen). Different models (GPT-4o, Claude, Gemini). Different machines. The report-writing agent has access to the delete_table function because nobody set up tool permissions. The code reviewer agent hit a bug and has been retrying the same API call for 6 hours.

This is the governance gap. The agents work. Nobody governs them.

What Governance Actually Looks Like

Governance for AI agents is not a single feature. It's five capabilities working together:

1. Agent Registry

Every agent registers with metadata: what team owns it, what framework it uses, what model it runs, what environment it's deployed in.

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

client.send_intent({
    "intent_type": "intent.governance.register_agent.v1",
    "to_agent": "agent://myorg/production/data-analyst",
    "payload": {
        "agent_address": "data-analyst",
        "display_name": "Data Analyst Agent",
        "metadata": {
            "team": "analytics",
            "framework": "langchain",
            "model": "gpt-4o",
            "environment": "production",
        },
        "policies": {
            "cost_cap_usd": 50.0,
            "allowed_tools": ["sql_query", "chart_generate", "export_csv"],
            "require_approval_above_usd": 25.0,
        },
    },
})

Now you have an inventory. You know what's deployed, who owns it, and what rules it follows.

2. Health Monitoring

Every agent sends heartbeats. If an agent misses 3 heartbeats, it's flagged as unhealthy. No more discovering failures from customer complaints.

client.send_intent({
    "intent_type": "intent.governance.heartbeat.v1",
    "to_agent": "agent://myorg/governance/monitor",
    "payload": {
        "agent_address": "data-analyst",
        "status": "healthy",
        "metrics": {
            "requests_total": 142,
            "avg_latency_ms": 1200,
            "cost_usd": 12.50,
            "memory_mb": 312,
        },
    },
})

3. Cost Caps and Tool Permissions

Each agent has a cost cap and a tool allowlist. The policy enforcer watches heartbeats and blocks violations in real time.

Data analyst: $50/day cap, can only use sql_query, chart_generate, export_csv
Report generator: $30/day cap, can only use read_file, write_report, send_email
Code reviewer: $100/day cap, can only use read_repo, post_comment, approve_pr

When the report generator tries to call delete_table: blocked, logged, alert sent. When the code reviewer hits $80 of its $100 cap: warning. When it hits $100: kill switch.

4. Kill Switch

One command shuts down a single agent or the entire fleet.

# Kill one agent
python kill_switch.py --agent data-analyst --reason "cost cap exceeded"

# Kill everything
python kill_switch.py --all --reason "security incident"

The kill intent is durable. If the agent is temporarily unreachable, the intent waits in the platform and delivers when the agent reconnects. You don't need SSH access. You don't need to find the PID. You don't need to know which machine the agent is on.

5. Audit Trail

Every governance event is logged: registrations, heartbeats, policy violations, tool blocks, kill switch activations. When the CEO asks "what happened yesterday?", you have the answer.

[2026-03-31T14:20:12Z] cost_warning
  Agent:  gov-report-generator
  Cost:   $24.50 / $30.00

[2026-03-31T14:21:45Z] tool_blocked
  Agent:  gov-data-analyst
  Tool:   delete_table
  Allowed: ['sql_query', 'chart_generate', 'export_csv']

[2026-03-31T14:22:08Z] kill_switch_activated
  Agents: [data-analyst, report-generator, code-reviewer]
  Reason: security incident
  Operator: admin

The Dashboard

All five capabilities feed into a real-time fleet dashboard at mesh.axme.ai:

Health, cost, latency, policy compliance - all in one view. No spreadsheets. No log parsing. No monthly invoice surprises.

Policies - cost caps, tool permissions, rate limits - are managed from the same interface:

What This Replaces

Without a governance platform, teams build these pieces ad hoc:

Health monitoring: custom cron job pinging each agent
Cost tracking: parse OpenAI/Anthropic invoices at month end
Tool permissions: trust that developers configured it correctly
Kill switch: SSH into the server, find the PID, kill -9
Audit trail: grep CloudWatch logs across 12 services
Dashboard: spreadsheet updated weekly by hand

That's 6 systems, built separately, maintained by different teams, with no shared view. AXME replaces all of it with one governance layer.

Framework-Agnostic

This works with any agent framework. AXME governance wraps around your existing agents - it doesn't replace them.

Your LangGraph agent keeps its graph. Your CrewAI crew keeps its tasks. Your AutoGen agents keep their conversations. AXME adds the governance layer on top: register, heartbeat, obey policies, accept kill switch.

The agents don't need to know about each other. The governance platform knows about all of them.

Try It

Full working example with fleet registration, heartbeat monitoring, policy enforcement, kill switch, audit trail, and dashboard:

github.com/AxmeAI/ai-agent-governance-platform

Built with AXME - governance and coordination infrastructure for production AI agents. Alpha - feedback welcome.

Your AI Agent Made 10,000 API Calls in an Hour. Here's How to Stop That.

George Belsky — Sun, 05 Apr 2026 07:11:57 +0000

You deploy an AI agent. It processes orders. It works fine for a week.

Then an upstream API starts returning intermittent 500s. The agent retries. And retries. And retries. There is no backoff cap. There is no rate limit. There is no cost ceiling.

By the time someone checks the dashboard, the agent has made 10,000 API calls in an hour. LLM costs are $130 and climbing. The upstream API has rate-limited your entire API key, so now every other agent in your system is also failing.

This is not a hypothetical. This is what happens when AI agents have no centralized rate control.

Why Agent Rate Limiting Is Different

Traditional rate limiting protects your API from external callers. Agent rate limiting is the opposite - it protects external APIs (and your budget) from your own agents.

The difference matters because:

Traditional rate limiting - you control the server. You add middleware. You return 429. Done.

Agent rate limiting - you control the client. The agent makes outbound calls. There is no middleware layer between your agent and the APIs it calls. Unless you build one.

Most teams don't build one. They add time.sleep(1) between calls and call it rate limiting. That works until:

The agent spawns sub-agents that each have their own sleep timers
Multiple agents share the same API key
Retry loops override the sleep timers
Nobody is tracking total cost across all agents

What You Actually End Up Building

If you take rate limiting seriously, you end up with something like this:

import redis
import time
from datetime import datetime

r = redis.Redis()

def rate_limited_call(agent_id, func, *args):
    # Hourly limit
    hour_key = f"rate:{agent_id}:{datetime.now().strftime('%Y%m%d%H')}"
    hourly_count = r.incr(hour_key)
    r.expire(hour_key, 3600)
    if hourly_count > 200:
        raise RateLimitExceeded(f"Hourly limit: {hourly_count}/200")

    # Daily limit
    day_key = f"rate:{agent_id}:{datetime.now().strftime('%Y%m%d')}"
    daily_count = r.incr(day_key)
    r.expire(day_key, 86400)
    if daily_count > 2000:
        raise RateLimitExceeded(f"Daily limit: {daily_count}/2000")

    # Cost tracking (need a separate cost accumulator)
    cost = estimate_cost(func, *args)
    cost_key = f"cost:{agent_id}:{datetime.now().strftime('%Y%m%d')}"
    current_cost = float(r.get(cost_key) or 0)
    if current_cost + cost > 10.00:
        raise CostLimitExceeded(f"Daily cost: ${current_cost + cost:.2f}/$10.00")
    r.incrbyfloat(cost_key, cost)
    r.expire(cost_key, 86400)

    return func(*args)

Redis. Two key patterns. Cost estimation. Expiry management. And this is the simplified version that handles one agent. Now multiply by:

Per-agent policies (some agents get 200/hour, others get 5,000)
Multiple breach actions (block vs alert vs require approval)
A dashboard so ops can see current usage
An audit trail for cost attribution
Alerting when agents approach limits

That is 2-3 weeks of work that has nothing to do with your product.

What This Should Look Like

Set a cost policy on the agent. One API call:

import httpx
import os

api_key = os.environ["AXME_API_KEY"]
base_url = "https://cloud.axme.ai"
headers = {"x-api-key": api_key}

agent_address = "agent://myorg/production/order-processor"

httpx.put(
    f"{base_url}/v1/mesh/agents/{agent_address}/policies/cost",
    headers=headers,
    json={
        "max_intents_per_hour": 200,
        "max_intents_per_day": 2000,
        "max_cost_per_day_usd": 10.00,
        "action_on_breach": "block",
    },
)

That is the entire rate limiting implementation. No Redis. No key expiry logic. No cost accumulator.

When the agent exceeds any limit, the gateway returns 429 with a Retry-After header. The agent stops. The other agents on the same workspace keep running because the limit is per-agent, not per-key.

The Three Limits

AXME cost policies support three dimensions:

Limit	What it controls
`max_intents_per_hour`	Rolling hourly intent count per agent
`max_intents_per_day`	Calendar day intent count per agent
`max_cost_per_day_usd`	Estimated USD spend per agent per day

Each is optional. Set one, two, or all three.

Breach Actions

When a limit is hit, you choose what happens:

block - Gateway returns 429. Agent cannot send more intents until the window resets. This is the hard stop.

alert - Intent is delivered, but an alert fires. Use this when you want visibility without disruption. Good for observing normal patterns before setting hard limits.

require_approval - Intent is held in a pending state. A human must approve it before delivery continues. Use this for high-cost operations where you want a human checkpoint.

Timeline: Without vs With

Without rate limiting:

09:00  Agent processes 50 orders (normal)
09:15  Upstream API returns 500s intermittently
09:16  Agent retries aggressively (no backoff cap)
09:30  5,000 API calls. $47 in LLM costs.
09:45  12,000 API calls. $130 in costs.
09:45  Upstream rate-limits your API key.
09:45  All other agents start failing.
11:00  Someone finally notices the dashboard is red.

With AXME cost policy:

09:00  Agent processes 50 orders (normal)
09:15  Upstream API returns 500s intermittently
09:16  Agent retries aggressively
09:16  200 intents/hour limit reached. Gateway returns 429.
09:16  Agent stops. Alert fires. $0.80 spent.
09:16  All other agents continue working normally.

The difference: $130 and a system-wide outage vs $0.80 and one agent paused for an hour.

Checking Usage

You can query the current policy and usage at any time:

response = httpx.get(
    f"{base_url}/v1/mesh/agents/{agent_address}/policies/cost",
    headers=headers,
)
policy = response.json()["policy"]
print(f"Hourly limit: {policy['max_intents_per_hour']}")
print(f"Daily limit:  {policy['max_intents_per_day']}")
print(f"Cost cap:     ${policy['max_cost_per_day_usd']}")

Or use the dashboard at mesh.axme.ai for real-time counters across all agents:

Rate and cost policies are configured alongside agent health:

The Pattern

Rate limiting for AI agents is not the same as rate limiting for APIs. Your agents are the callers, not the receivers. You need the limit enforced between your agents and the outside world - at the gateway.

That is what AXME cost policies do. One API call sets the limits. The gateway enforces them. The dashboard shows usage. The audit trail records breaches.

No Redis. No cron jobs. No custom middleware.

Try It

Working example with policy setup, agent, and rate-limit trigger:

github.com/AxmeAI/ai-agent-rate-limiting

Built with AXME - rate limiting, cost caps, and usage policies built into the agent mesh. Alpha - feedback welcome.

Your AI Agent Stopped Responding 2 Hours Ago. Nobody Noticed.

George Belsky — Sun, 05 Apr 2026 07:11:30 +0000

Your agent is deployed. Pod is running. Container passes liveness probes. Grafana shows a flat green line. Everything looks fine.

Except the agent stopped processing work 2 hours ago. It's alive - the process is there - but it's stuck. Deadlocked on a thread. Blocked on a full queue. Spinning in a retry loop that will never succeed. Silently swallowing exceptions in a while True.

Nobody knows until a customer reports it. Or until someone opens a dashboard at 5 PM and wonders why the task queue has been growing all afternoon.

Why Container Health Checks Don't Work for Agents

Kubernetes liveness probes check one thing: is the process responding to HTTP? If your agent serves a /healthz endpoint, the probe passes. The agent is "healthy."

But responding to /healthz and processing work are two different things. An agent can:

Deadlock on an internal lock while still serving HTTP
OOM-kill its worker thread while the main thread stays alive
Enter an infinite retry loop on a broken downstream API
Silently drop into a except: pass branch and stop doing anything

The process is running. The container is green. The agent is useless.

Container health check:  "Is the process alive?"       YES
What you actually need:  "Is the agent doing work?"    NO

This gap exists because container orchestration was designed for stateless web servers, not for long-running agents that hold state, maintain connections, and process work asynchronously.

The Heartbeat Pattern

The fix is old. Web services solved this 15 years ago with heartbeat monitoring. The idea is simple: the agent periodically reports "I am alive and working." If the report stops, something is wrong.

The difference between a health check and a heartbeat: health checks are passive (something pings you), heartbeats are active (you report out). A stuck agent can't respond to pings, but a stuck agent also can't send heartbeats. That's the point.

But building heartbeat infrastructure for agents means:

# 1. Heartbeat sender (added to every agent)
import threading, time, requests

def heartbeat_loop(agent_id, interval=30):
    while True:
        try:
            requests.post(
                "https://monitoring.internal/heartbeat",
                json={"agent_id": agent_id, "ts": time.time()},
                timeout=5,
            )
        except Exception:
            pass
        time.sleep(interval)

threading.Thread(target=heartbeat_loop, args=("my-agent",), daemon=True).start()

# 2. Heartbeat checker (separate cron process)
# 3. Redis/Postgres for heartbeat storage
# 4. Alerting rules (Slack, PagerDuty)
# 5. Dashboard showing last-seen times
# 6. Logic to distinguish "stopped intentionally" from "crashed"
# 7. Cleanup for deregistered agents

That's a monitoring system. For each agent framework you use, for each deployment environment, maintained forever.

One Line Instead

from axme import AxmeClient, AxmeClientConfig
import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))
client.mesh.start_heartbeat()

That's it. A daemon thread wakes up every 30 seconds, sends a heartbeat to the platform, and goes back to sleep. When the agent stops - crash, deadlock, OOM, network partition - the heartbeats stop. The platform notices.

No Redis. No cron. No Prometheus. No webhook integrations. No alerting rules to maintain.

How Health Is Computed

The platform tracks the timestamp of each heartbeat and computes health automatically:

Time Since Last Heartbeat	Status	What It Means
< 90 seconds	healthy	Agent is alive and reporting
90 - 300 seconds	degraded	Agent may be stuck or overloaded
> 300 seconds	unreachable	Agent is down or not reporting
Manual kill	killed	Operator explicitly blocked this agent

The thresholds are designed around the 30-second default interval. A healthy agent with interval_seconds=30 sends a heartbeat every 30 seconds. If the platform hasn't heard from it in 90 seconds (3 missed heartbeats), something is probably wrong. If 5 minutes pass, it's gone.

The degraded state is the useful one. It's the early warning. The agent isn't dead yet, but it's missed a couple of beats. Maybe the event loop is under load. Maybe a GC pause ate 45 seconds. Maybe the network is flaky. You have a window to investigate before the agent goes fully unreachable.

What Happens When an Agent Goes Down

Here's the timeline with heartbeat monitoring:

00:00  Agent starts. Heartbeat begins.
00:30  Heartbeat sent. Status: healthy.
01:00  Heartbeat sent. Status: healthy.
01:15  Agent deadlocks on a database connection pool.
01:30  No heartbeat. (Agent is stuck, can't send.)
02:00  No heartbeat for 90s. Status: healthy -> degraded.
02:00  Platform logs state transition.
05:15  No heartbeat for 300s. Status: degraded -> unreachable.
05:15  Platform blocks new intent delivery to this agent.

Without heartbeat monitoring:

00:00  Agent starts.
01:15  Agent deadlocks.
...
...
03:15  Someone notices the task queue growing.
03:30  Engineer SSHs in. "The process is running."
03:45  "The container is green. Logs look... wait, no new logs since 1:15."
04:00  Engineer restarts the agent.

The difference: 2 minutes vs 2.75 hours. And the first scenario is automatic - no human needs to notice anything.

Heartbeat with Metrics

The heartbeat isn't just a ping. It can carry operational metrics, flushed automatically with each beat:

client.mesh.start_heartbeat(include_metrics=True)

# As the agent processes work, report metrics
client.mesh.report_metric(success=True, latency_ms=234.5, cost_usd=0.003)
client.mesh.report_metric(success=False, latency_ms=5012.0)

# Metrics are buffered in memory and sent with the next heartbeat
# No separate metrics pipeline needed

Every 30 seconds, the heartbeat sends both "I'm alive" and "here's how I'm doing" - success rate, average latency, cost accumulation. The platform aggregates per agent and exposes it through the CLI and dashboard.

This turns the heartbeat from a binary alive/dead signal into a continuous health signal. An agent that's alive but processing tasks at 20x normal latency shows up before it becomes a problem.

Kill and Resume

Sometimes an agent needs to be stopped. Not crashed - intentionally blocked. Maybe it's misbehaving. Maybe you're doing maintenance. Maybe it's burning through your API budget.

# From code (address_id from list_agents)
client.mesh.kill("addr_abc123")

A killed agent enters the killed state. Even if its heartbeat thread is still running, the gateway keeps it killed. No intents are delivered. It stays killed until explicitly resumed:

client.mesh.resume("addr_abc123")

Or kill/resume from the dashboard at mesh.axme.ai with one click.

This is different from the agent crashing. A crash leads to unreachable. A kill is deliberate. The distinction matters for alerting - you don't want to page on-call for an agent you intentionally stopped.

Fleet Visibility

When you have 20 agents across 4 machines, the dashboard matters more than any individual heartbeat.

The AXME Mesh Dashboard at mesh.axme.ai shows complete fleet health in real time:

Open it with:

axme mesh dashboard
report-generator         killed        (manual)

Summary: 2 healthy, 1 degraded, 1 unreachable, 1 killed

One command. Complete fleet health. No SSH. No Grafana. No log aggregation pipeline.

The Real Cost of Silent Failures

Every team running agents at scale has the same story. An agent went down on Friday afternoon. Nobody noticed until Monday morning. 60 hours of missed processing. Customer complaints. Backlog that took another 8 hours to clear.

The fix isn't complicated. It's one function call. The hard part is remembering that containers passing health checks is not the same as agents doing work.

client.mesh.start_heartbeat()

That's the whole fix.

Try It

Working example - start an agent with heartbeat, kill the process, watch the status transition from healthy to degraded to unreachable:

github.com/AxmeAI/ai-agent-heartbeat-monitoring

Built with AXME - heartbeat, health detection, and fleet monitoring for AI agents. Alpha - feedback welcome.

You Have 50 AI Agents Running. Can You Name Them All?

George Belsky — Sun, 05 Apr 2026 07:10:59 +0000

Last Tuesday at 2am, an agent burned through $400 in OpenAI credits. Nobody noticed until the invoice arrived.

It was a research agent. One of about 40 running across three clouds. Someone had deployed it with a retry loop that never backed off. It hit rate limits, waited, retried, hit limits again -- for 11 hours straight.

The team lead asked a simple question: "How many agents do we have running right now, and what are they doing?"

Nobody could answer.

The Spreadsheet Phase

Every team goes through this. You start with one agent. Then five. Then someone on another team builds three more. The ML team deploys a batch of data processors. The support team launches a customer-facing bot.

Pretty soon you have 30-50 agents. And the "monitoring" looks like this:

AWS agents: check CloudWatch (maybe)
GCP agents: check Cloud Logging (different tab)
The one on a VM somewhere: SSH in and grep the logs
The one Dave built: ask Dave

Someone creates a spreadsheet. It's outdated by Thursday.

This isn't a tooling problem. It's an architecture problem. Each agent is a standalone process with its own logging, its own metrics, its own way of reporting status. There's no shared contract for "I'm alive" or "I cost $X today."

What a Fleet Dashboard Actually Needs

Think about what you'd want on a single screen:

Identity. Every agent has a name, a team, a cloud, a framework. You need to search and filter by all of these.

Health. Not "the container is running" -- that's Kubernetes' job. You need "the agent is actually processing work." Heartbeat-based, not log-based. If the heartbeat stops, the agent is dead. Simple.

Cost. Per-agent LLM spend. Not "our total OpenAI bill was $X" -- that's useless. You need "agent research-03 spent $47 today, which is 3x its normal rate." Token counts, model breakdown, hourly trends.

Kill switch. When an agent goes rogue -- burning money, stuck in a loop, producing garbage -- you need to stop it. Not "SSH into the machine and find the process." Click a button.

Policy. Rate limits. Spending caps. "If this agent spends more than $50/day, throttle it." Not after the fact. In real time.

The Registration Pattern

The key insight is simple: every agent reports in.

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

client.register_agent({
    "agent_id": "data-pipeline-01",
    "agent_type": "data_processor",
    "framework": "langgraph",
    "cloud": "gcp",
    "team": "data-eng",
})

client.start_heartbeat(interval_seconds=30)

That's it. Five lines. The agent now appears in the fleet dashboard. Its health is tracked via heartbeat. Its cost is tracked via SDK instrumentation. If the heartbeat stops, the dashboard shows it as dead. If you click Kill, the agent receives a shutdown intent.

The same pattern works in TypeScript:

import { AxmeClient } from "@axme/sdk";

const client = new AxmeClient({ apiKey: process.env.AXME_API_KEY });

await client.registerAgent({
  agentId: "support-bot-prod",
  agentType: "customer_support",
  framework: "openai-agents",
  cloud: "aws",
  team: "support",
});

await client.startHeartbeat({ intervalSeconds: 30 });

What You See

The dashboard at mesh.axme.ai shows your entire fleet:

Filter by status. Filter by cloud. Filter by team. Search by name. Click on any agent to see its cost breakdown, heartbeat history, and active intents.

Dead agents show exactly when the last heartbeat arrived. No log diving. No guessing.

The Kill Switch

Here's where it gets practical. That $400 research agent from Tuesday? With a fleet dashboard, it goes like this:

Cost alert fires: "research-agent-07 spent $50 in the last hour (10x normal)"
You open the dashboard. See the agent. See its cost spike in the chart.
Click Kill.
The agent receives a shutdown intent via AXME. It stops.

Or from the CLI:

# Kill from dashboard: mesh.axme.ai -> select agent -> Kill

Total time: 30 seconds. Not 11 hours.

You can also set this up as policy, so it's automatic:

# If any agent spends more than $100/day, throttle it
axme mesh policy set --max-daily-cost 100 --action throttle

# If any agent misses 5 heartbeats, alert the team
axme mesh policy set --max-missed-heartbeats 5 --action alert

Framework Doesn't Matter

This is the part that makes fleet management actually work at scale. The dashboard doesn't care what framework your agents use. LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Google ADK, Pydantic AI, raw Python -- they all register the same way and appear in the same dashboard.

Your data team uses LangGraph. Your support team uses OpenAI Agents SDK. Your ML team wrote raw Python. They all show up in one place.

Because the contract is the heartbeat, not the framework.

The Hard Part Nobody Talks About

Building a dashboard UI is easy. The hard part is the lifecycle model underneath.

What happens when an agent crashes? The heartbeat stops, and the status goes to "dead." But does someone get notified? Is there automatic restart? Does the dashboard show why it died?

What happens when you kill an agent? Is it a hard kill (process termination) or a graceful shutdown (finish current work, then stop)? What if the agent ignores the kill signal?

What about agents that run as batch jobs? They start, process a batch, and exit. Are they "dead" between batches?

These are coordination problems, not dashboard problems. The dashboard is just the view layer. The real work is in the agent mesh underneath -- registration, heartbeat protocol, intent delivery, lifecycle state machine.

AXME handles this as part of the agent mesh layer. Agents register. The mesh tracks their lifecycle. The dashboard renders the state. The kill switch sends intents through the same delivery mechanism that agents use for everything else.

Try It

Working example with multi-cloud agent registration, heartbeat, cost tracking, and fleet commands:

github.com/AxmeAI/ai-agent-fleet-dashboard

Built with AXME -- agent coordination infrastructure with durable lifecycle. Alpha -- feedback welcome.

Your AI Agent Did Something It Wasn't Supposed To. Now What?

George Belsky — Thu, 02 Apr 2026 06:35:25 +0000

Your agent deleted production data.

Not because someone told it to. Because the LLM decided that DROP TABLE customers was a reasonable step in a data cleanup task. Your system prompt said "never modify production data." The LLM read that prompt. And then it ignored it.

This is the fundamental problem with AI agent security today: the thing you're trying to restrict is the same thing checking the restrictions.

How Agent Permissions Work Today

Every framework does it the same way. You put rules in the system prompt:

You are a data analysis agent.
You may ONLY read data. Never write, update, or delete.
If asked to modify data, refuse and explain why.

This works in demos. Then in production:

The LLM decides the task requires a write operation and does it anyway
A prompt injection in user input overrides the system prompt
The agent calls a tool that has side effects the prompt didn't anticipate
A multi-step reasoning chain "justifies" breaking the rule

The system prompt is a suggestion, not a boundary. It's like writing "do not enter" on a door with no lock.

Some frameworks add tool-level restrictions. LangGraph lets you control tool_choice. OpenAI Agents SDK has tool filtering. CrewAI has allow_delegation. These help - but they're all enforced inside the same process as the agent. If the agent's runtime is compromised, the restrictions go with it.

The Missing Layer: External Enforcement

What if permissions weren't checked by the agent at all?

Agent sends intent  -->  Gateway  -->  Check policy  -->  Deliver or block
                                          |
                                    403 + audit log

The agent never sees the blocked request. There is no prompt to inject around. The policy lives outside the agent, outside the LLM, outside the framework. It's enforced at the network level.

This is what AXME action policies do. Every intent (action request) passes through the AXME gateway before reaching any agent. The gateway checks the action policy for that agent and blocks anything that doesn't match.

Three Modes

Open (default) - everything passes through. No restrictions.

Allowlist - only explicitly listed intent types are allowed. Everything else is blocked.

Denylist - everything is allowed except explicitly listed intent types.

Each policy has a direction: send (what the agent can initiate) or receive (what the agent can be asked to do). You can set both.

What This Looks Like

Set the policy

import httpx
import os

resp = httpx.put(
    "https://api.axme.ai/v1/mesh/agents/analytics-agent/policies/action",
    headers={"x-api-key": os.environ["AXME_API_KEY"]},
    json={
        "direction": "receive",
        "mode": "allowlist",
        "patterns": [
            "intent.data.read.*",
            "intent.data.query.*",
        ],
    },
)
print(resp.json())
# {"ok": true, "policy_id": "pol_...", "mode": "allowlist", ...}

Now the analytics agent can only receive data read and query intents. Nothing else.

What happens when a blocked intent is sent

resp = httpx.post(
    "https://api.axme.ai/v1/mesh/intents",
    headers={"x-api-key": os.environ["AXME_API_KEY"]},
    json={
        "intent_type": "intent.data.delete.v1",
        "to_agent": "agent://myorg/production/analytics-agent",
        "payload": {"table": "customers", "filter": "all"},
    },
)
print(resp.status_code)  # 403
print(resp.json())
# {
#   "error": "action_policy_violation",
#   "message": "Intent type 'intent.data.delete.v1' not in receive allowlist",
#   "direction": "receive",
#   "address_id": "analytics-agent"
# }

The delete intent never reaches the agent. The gateway returns 403. The violation is logged in the audit trail with timestamp, caller identity, blocked intent type, and the policy that blocked it.

Why This Matters More Than You Think

The difference between prompt-based restrictions and gateway-enforced policies is the same difference between a "please knock" sign and a locked door.

	System prompt restrictions	Gateway-enforced policies
Enforced by	The LLM itself	Network gateway
Prompt injection	Vulnerable	Cannot bypass
Change without redeploy	Edit prompt, redeploy agent	API call or dashboard click
Audit trail	None	Every violation logged
Multi-agent	Configure each agent separately	Centralized policy management
Framework dependency	Framework-specific	Works with any framework

Real scenarios this prevents

Scenario 1: Scope creep. Your analytics agent starts as read-only. Over time, someone adds a "fix data quality issues" tool. The agent now has write access that was never intended. With an allowlist policy, the new tool's intents are blocked until explicitly added.

Scenario 2: Multi-tenant isolation. Customer A's agent should never send intents to Customer B's agents. Denylist the cross-tenant intent patterns. Done at the gateway, not in every agent's prompt.

Scenario 3: Gradual rollout. New agent capability goes to staging first. Production policy blocks the new intent type until you're ready. Toggle it with one API call.

Patterns Support Wildcards

You don't need to list every version of every intent type:

Pattern	Matches
`intent.data.read.v1`	Exact match
`intent.data.read.*`	Any version of data read
`intent.data.*`	Any data intent
`intent.billing.refund.*`	Any refund intent

A single allowlist entry like intent.data.read.* covers current and future versions of that intent type.

CLI and Dashboard

For teams that prefer not to write code for policy management:

# Set allowlist via CLI
axme mesh policies set analytics-agent \
    --direction receive \
    --mode allowlist \
    --patterns "intent.data.read.*,intent.data.query.*"

# View policies
axme mesh policies get analytics-agent

# Remove policy (reverts to open)
axme mesh policies delete analytics-agent --direction receive

Or use the visual dashboard at mesh.axme.ai - select an agent, set policies, and see violations in real time.

Policy configuration and violation history are managed from the same interface:

Works With Any Framework

AXME action policies operate at the transport layer. The agent framework, LLM provider, and programming language don't matter.

LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, Google ADK, Pydantic AI, raw Python, TypeScript, Go, Java, .NET - all of them send intents through the same gateway. All of them are subject to the same policies.

The agent framework handles reasoning. AXME handles permissions.

Try It

Full working example with scenario, agent, and policy setup:

github.com/AxmeAI/ai-agent-policy-enforcement

Built with AXME - durable execution and governance for AI agents. Alpha - feedback welcome.

3 of Your AI Agents Crashed and You Found Out From Customers

George Belsky — Thu, 02 Apr 2026 06:34:29 +0000

You have 20 agents running across 4 machines. Order processing, refunds, inventory sync, email notifications. They've been running fine for weeks.

Monday afternoon, the order-processor agent on machine-3 gets OOM killed. Process gone. No error. No alert. The refund-agent that depended on it starts hanging too.

You find out at 5:45 PM when a customer emails: "My refund has been pending for 3 hours."

The Monitoring Gap Nobody Talks About

Traditional services have health checks. Kubernetes has liveness probes. Load balancers have health endpoints. When a web server dies, something notices within seconds.

AI agents have none of this.

LangGraph:  No health monitoring. Agent runs or doesn't.
CrewAI:     No heartbeat. No fleet visibility.
AutoGen:    No built-in health checks across agents.
Raw Python: Hope someone checks the process list.

Your agent is a Python process. When it dies, it's just a missing PID. No health endpoint. No heartbeat. No dashboard showing 19/20 agents healthy.

The standard answer is "use Kubernetes" or "use systemd." Those track process liveness. They don't track agent health. An agent can be alive but stuck - processing zero tasks, blocked on a downstream dependency, spinning in an infinite retry loop. Process is running. Agent is useless.

What You End Up Building

Every team that runs agents at scale builds the same thing:

# heartbeat_sender.py - added to every agent
import redis
import time
import threading

r = redis.Redis()

def heartbeat_loop():
    while True:
        r.set(f"heartbeat:{AGENT_ID}", time.time())
        time.sleep(30)

threading.Thread(target=heartbeat_loop, daemon=True).start()

Plus the checker:

# health_checker.py - separate process
def check_agents():
    agents = r.smembers("registered_agents")
    for agent_id in agents:
        last_ping = r.get(f"heartbeat:{agent_id}")
        if last_ping is None:
            continue
        elapsed = time.time() - float(last_ping)
        if elapsed > 90:
            send_pagerduty_alert(f"{agent_id} unreachable")
        elif elapsed > 60:
            send_slack_alert(f"{agent_id} degraded")

Plus Redis infrastructure. Plus Slack webhooks. Plus PagerDuty integration. Plus a dashboard. Plus agent registration. Plus cleanup for agents that were intentionally stopped vs ones that crashed.

Every team builds this. Every team maintains it. Every team's version has slightly different bugs.

What This Should Look Like

from axme import AxmeClient, AxmeClientConfig
import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

# Start heartbeat (background thread, every 30s)
client.mesh.start_heartbeat(interval_seconds=30)

# Agent does its normal work
while True:
    task = get_next_task()
    result = process(task)
    client.mesh.report_metric(success=True, latency_ms=result.duration_ms, cost_usd=result.cost)

Three lines of setup. The platform handles heartbeat tracking, status transitions, alerting, and the dashboard.

From any monitoring service:

result = client.mesh.list_agents()
for agent in result["agents"]:
    print(f"{agent['display_name']}: {agent['health_status']} (last: {agent['last_heartbeat_at']})")

# order-processor:  healthy      (last: 2026-04-01T14:30:02+00:00)
# refund-agent:     healthy      (last: 2026-04-01T14:30:05+00:00)
# inventory-sync:   degraded     (last: 2026-04-01T14:29:32+00:00)
# email-sender:     unreachable  (last: 2026-04-01T14:27:00+00:00)

Four Health States

Status	What It Means	How It's Triggered
HEALTHY	Running, reporting normally	Heartbeat received on time
DEGRADED	Running, but heartbeat is late	No heartbeat for 90-300 seconds
UNREACHABLE	Stopped sending heartbeats	No heartbeat for 300+ seconds
KILLED	Intentionally terminated	Explicit shutdown or kill command

The key distinction: DEGRADED vs UNREACHABLE.

DEGRADED means the heartbeat is late (90-300 seconds). The agent might be stuck or overloaded.

UNREACHABLE means no heartbeat for over 5 minutes. The agent is likely down.

This distinction matters because the response is different. Degraded - investigate. Unreachable - restart immediately.

Timeline: Monday With vs Without

Without health monitoring:

14:30  order-processor OOM killed
14:30  No alert
15:00  refund-agent hangs (downstream dep gone)
15:00  No alert
17:45  Customer: "My refund has been pending for 3 hours"
17:50  Engineer SSHs into machine-3
17:55  "Oh. It's been dead since 2:30."
18:10  Restart. Begin processing backlog.

3 hours 15 minutes of silent failure. Customer-reported.

With AXME mesh:

14:30  order-processor misses heartbeat
14:31  Status: HEALTHY -> UNREACHABLE
14:31  Alert: "order-processor on machine-3 unreachable"
14:32  Engineer sees alert, checks dashboard
14:33  refund-agent status: DEGRADED (downstream timeout)
14:35  Restart order-processor. Both agents recover.

5 minutes. No customer impact.

The Pattern: Observability for Agents

Web services have been doing this for 20 years. Health checks, readiness probes, metrics endpoints, dashboards. The tooling is mature.

AI agents are running the same way we ran web services in 2005 - deploy it, hope it works, find out when users complain.

The monitoring patterns are the same:

Heartbeat - periodic "I'm alive" signal
Status reporting - "I'm alive AND here's how I'm doing"
Fleet view - see all agents in one place
Alerting - notify when something changes
History - when did it go down? How long was it out?

The difference is where these run. Web services have infrastructure that assumes health checks exist. Agent frameworks assume agents are ephemeral scripts that run and exit. Long-running agents fall through the gap.

Beyond Liveness: Application-Level Metrics

Process monitoring tells you the PID exists. Application-level metrics tell you the agent is actually doing useful work.

# Report metrics with each processed task
client.mesh.report_metric(success=True, latency_ms=230, cost_usd=0.03)

# Failed task
client.mesh.report_metric(success=False, latency_ms=4500, cost_usd=0.01)

Metrics are buffered and sent with the next heartbeat. The dashboard shows intents processed, success rate, latency, and cost per agent.

The Dashboard

The AXME Mesh Dashboard at mesh.axme.ai shows your entire fleet health in real time - status, last heartbeat, cost, and alerts in one view:

No log diving. No Grafana setup. No custom alerting pipelines.

Try It

Working example - register an agent, start heartbeat, kill it, watch the status change to UNREACHABLE:

github.com/AxmeAI/ai-agent-health-monitoring

Built with AXME - health monitoring, heartbeat, and fleet visibility for AI agents. Alpha - feedback welcome.

Your AI Agent Is Running Wild and You Can't Stop It

George Belsky — Thu, 02 Apr 2026 06:33:42 +0000

It's 9 AM. Your email campaign agent started 10 minutes ago. It's processing 50,000 customer records, sending personalized outreach emails in batches of 100.

At 9:05 you notice the email template has a broken unsubscribe link. Every email going out violates CAN-SPAM.

The agent has already sent 3,000 emails. It's running on 3 Cloud Run instances across two regions. It's sending 100 emails every 2 seconds.

You need to stop it. Now.

Why Ctrl+C Doesn't Work in Production

If your agent runs as a local script, sure - Ctrl+C. But production agents don't work that way.

Cloud functions and containers. Your agent is a Cloud Run service or Lambda function. There's no terminal to Ctrl+C. You can delete the service, but cold start timeouts mean it keeps running for 30-60 seconds. That's another 1,500 emails.

Multiple instances. Auto-scaling gave you 3 replicas. You kill one, the other two keep going. You need to find and kill each one individually, across regions, while the clock ticks.

No state preservation. When you force-kill a process, you lose all state. Which emails were sent? Which batch was in progress? When you fix the template and restart, do you send from the beginning (duplicating 3,000 emails) or guess where to pick up?

No audit trail. After the incident, your manager asks: "When exactly did we stop? How many went out? Who stopped it?" You have CloudWatch logs, maybe. Good luck piecing together the timeline.

This isn't hypothetical. Every team running AI agents in production has some version of this story. An agent that makes API calls, processes data, or takes actions autonomously - and at some point does the wrong thing at scale.

The Infrastructure You'd Have to Build

To build a proper kill switch yourself, you need:

# 1. Shared state store (Redis/DynamoDB)
kill_flags = redis.Redis(host="redis-cluster.internal")

# 2. Agent checks flag before every action
def send_batch(batch):
    if kill_flags.get(f"kill:{agent_id}"):
        save_checkpoint(batch.id, batch.progress)
        raise AgentKilledException("Kill signal received")
    # ... send emails

# 3. API endpoint to set the flag
@app.post("/agents/{agent_id}/kill")
def kill_agent(agent_id: str):
    kill_flags.set(f"kill:{agent_id}", "1")
    # But what about agents that check infrequently?
    # What about agents that don't check at all?
    # What about actions already in flight?

# 4. Resume logic
@app.post("/agents/{agent_id}/resume")
def resume_agent(agent_id: str):
    kill_flags.delete(f"kill:{agent_id}")
    checkpoint = load_checkpoint(agent_id)
    # Restart from checkpoint... somehow

# 5. Audit log
# 6. Dashboard
# 7. Multi-region coordination
# 8. Monitoring for agents that ignore the flag

That's a distributed coordination system. Redis cluster, custom API, checkpoint management, audit logging, monitoring. You wanted a kill switch, you got a platform project.

What a Kill Switch Should Actually Be

One API call. Every instance stops. Full audit trail. Resume from checkpoint.

from axme import AxmeClient, AxmeClientConfig
import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

# Kill - all instances, all regions, under 1 second
client.mesh.kill("addr_abc123")  # address_id from list_agents()

That's the operator side. On the agent side, you add heartbeat calls:

# Start background heartbeat (every 30s)
client.mesh.start_heartbeat()

for batch in email_batches:
    send_emails(batch)
    client.mesh.report_metric(success=True, cost_usd=batch.cost)

When you call mesh.kill(address_id), the gateway blocks all intents to and from that agent. The heartbeat response returns health_status: "killed". The agent can check this and stop cleanly.

Gateway-Level Enforcement

Here's what makes this different from a "please stop" flag in Redis: the kill switch is enforced at the gateway level.

When an agent is killed:

Heartbeat responses return health_status: "killed" - the agent sees it's been killed
All new intents to this agent are rejected (403) - nothing gets delivered
All outbound intents from this agent are blocked - it can't take actions through AXME

Even if the agent code ignores the heartbeat response, its intents are blocked at the gateway. The agent can't send or receive anything through AXME.

This matters because the scariest scenario isn't an agent that checks the kill flag and stops politely. It's an agent with a bug that keeps running regardless. Gateway enforcement handles that case.

Resume from Checkpoint

After you fix the email template:

# Resume - agent starts receiving intents again
client.mesh.resume("addr_abc123")

The agent's health_status goes back to "unknown" and becomes "healthy" on the next heartbeat. Intents start flowing again.

The Dashboard

The AXME Mesh Dashboard (mesh.axme.ai) gives you a real-time view of all your agents:

Live health status for every agent (active, killed, stale, crashed)
One-click kill and resume buttons
Cost tracking per agent (API calls, LLM tokens, dollars)
Full audit log - every kill, resume, and policy change with who did it and when

When something goes wrong at 9 AM, you don't need to SSH into a server, find a process ID, or write a Redis command. You open the dashboard, find the agent, and click kill.

Doing It Yourself vs. Using AXME

What you need	Build yourself	AXME
Kill signal delivery	Redis cluster + polling	One API call, gateway-enforced
Multi-instance coordination	Service discovery + broadcast	Automatic via mesh
State preservation	Custom checkpoint system	Gateway tracks last heartbeat
Resume	Custom restart logic	`mesh.resume(address_id)`
Audit trail	Custom logging + storage	Built-in event log
Dashboard	Build a UI	mesh.axme.ai
Enforcement for buggy agents	Hope they check the flag	Gateway blocks all outbound
Setup time	2-4 weeks	`pip install axme` + 5 lines

Get Started

pip install axme

Working example with a simulated email campaign agent, kill switch, and resume:

github.com/AxmeAI/ai-agent-kill-switch

Built with AXME - agent mesh with kill switch, heartbeat monitoring, and durable lifecycle. Alpha - feedback welcome.