You have 20 agents running across 4 machines. Order processing, refunds, inventory sync, email notifications. They've been running fine for weeks.
Monday afternoon, the order-processor agent on machine-3 gets OOM killed. Process gone. No error. No alert. The refund-agent that depended on it starts hanging too.
You find out at 5:45 PM when a customer emails: "My refund has been pending for 3 hours."
The Monitoring Gap Nobody Talks About
Traditional services have health checks. Kubernetes has liveness probes. Load balancers have health endpoints. When a web server dies, something notices within seconds.
AI agents have none of this.
LangGraph: No health monitoring. Agent runs or doesn't.
CrewAI: No heartbeat. No fleet visibility.
AutoGen: No built-in health checks across agents.
Raw Python: Hope someone checks the process list.
Your agent is a Python process. When it dies, it's just a missing PID. No health endpoint. No heartbeat. No dashboard showing 19/20 agents healthy.
The standard answer is "use Kubernetes" or "use systemd." Those track process liveness. They don't track agent health. An agent can be alive but stuck - processing zero tasks, blocked on a downstream dependency, spinning in an infinite retry loop. Process is running. Agent is useless.
What You End Up Building
Every team that runs agents at scale builds the same thing:
# heartbeat_sender.py - added to every agent
import redis
import time
import threading
r = redis.Redis()
def heartbeat_loop():
while True:
r.set(f"heartbeat:{AGENT_ID}", time.time())
time.sleep(30)
threading.Thread(target=heartbeat_loop, daemon=True).start()
Plus the checker:
# health_checker.py - separate process
def check_agents():
agents = r.smembers("registered_agents")
for agent_id in agents:
last_ping = r.get(f"heartbeat:{agent_id}")
if last_ping is None:
continue
elapsed = time.time() - float(last_ping)
if elapsed > 90:
send_pagerduty_alert(f"{agent_id} unreachable")
elif elapsed > 60:
send_slack_alert(f"{agent_id} degraded")
Plus Redis infrastructure. Plus Slack webhooks. Plus PagerDuty integration. Plus a dashboard. Plus agent registration. Plus cleanup for agents that were intentionally stopped vs ones that crashed.
Every team builds this. Every team maintains it. Every team's version has slightly different bugs.
What This Should Look Like
from axme import AxmeClient, AxmeClientConfig
import os
client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))
# Start heartbeat (background thread, every 30s)
client.mesh.start_heartbeat(interval_seconds=30)
# Agent does its normal work
while True:
task = get_next_task()
result = process(task)
client.mesh.report_metric(success=True, latency_ms=result.duration_ms, cost_usd=result.cost)
Three lines of setup. The platform handles heartbeat tracking, status transitions, alerting, and the dashboard.
From any monitoring service:
result = client.mesh.list_agents()
for agent in result["agents"]:
print(f"{agent['display_name']}: {agent['health_status']} (last: {agent['last_heartbeat_at']})")
# order-processor: healthy (last: 2026-04-01T14:30:02+00:00)
# refund-agent: healthy (last: 2026-04-01T14:30:05+00:00)
# inventory-sync: degraded (last: 2026-04-01T14:29:32+00:00)
# email-sender: unreachable (last: 2026-04-01T14:27:00+00:00)
Four Health States
| Status | What It Means | How It's Triggered |
|---|---|---|
| HEALTHY | Running, reporting normally | Heartbeat received on time |
| DEGRADED | Running, but heartbeat is late | No heartbeat for 90-300 seconds |
| UNREACHABLE | Stopped sending heartbeats | No heartbeat for 300+ seconds |
| KILLED | Intentionally terminated | Explicit shutdown or kill command |
The key distinction: DEGRADED vs UNREACHABLE.
DEGRADED means the heartbeat is late (90-300 seconds). The agent might be stuck or overloaded.
UNREACHABLE means no heartbeat for over 5 minutes. The agent is likely down.
This distinction matters because the response is different. Degraded - investigate. Unreachable - restart immediately.
Timeline: Monday With vs Without
Without health monitoring:
14:30 order-processor OOM killed
14:30 No alert
15:00 refund-agent hangs (downstream dep gone)
15:00 No alert
17:45 Customer: "My refund has been pending for 3 hours"
17:50 Engineer SSHs into machine-3
17:55 "Oh. It's been dead since 2:30."
18:10 Restart. Begin processing backlog.
3 hours 15 minutes of silent failure. Customer-reported.
With AXME mesh:
14:30 order-processor misses heartbeat
14:31 Status: HEALTHY -> UNREACHABLE
14:31 Alert: "order-processor on machine-3 unreachable"
14:32 Engineer sees alert, checks dashboard
14:33 refund-agent status: DEGRADED (downstream timeout)
14:35 Restart order-processor. Both agents recover.
5 minutes. No customer impact.
The Pattern: Observability for Agents
Web services have been doing this for 20 years. Health checks, readiness probes, metrics endpoints, dashboards. The tooling is mature.
AI agents are running the same way we ran web services in 2005 - deploy it, hope it works, find out when users complain.
The monitoring patterns are the same:
- Heartbeat - periodic "I'm alive" signal
- Status reporting - "I'm alive AND here's how I'm doing"
- Fleet view - see all agents in one place
- Alerting - notify when something changes
- History - when did it go down? How long was it out?
The difference is where these run. Web services have infrastructure that assumes health checks exist. Agent frameworks assume agents are ephemeral scripts that run and exit. Long-running agents fall through the gap.
Beyond Liveness: Application-Level Metrics
Process monitoring tells you the PID exists. Application-level metrics tell you the agent is actually doing useful work.
# Report metrics with each processed task
client.mesh.report_metric(success=True, latency_ms=230, cost_usd=0.03)
# Failed task
client.mesh.report_metric(success=False, latency_ms=4500, cost_usd=0.01)
Metrics are buffered and sent with the next heartbeat. The dashboard shows intents processed, success rate, latency, and cost per agent.
The Dashboard
The AXME Mesh Dashboard at mesh.axme.ai shows your entire fleet health in real time - status, last heartbeat, cost, and alerts in one view:
No log diving. No Grafana setup. No custom alerting pipelines.
Try It
Working example - register an agent, start heartbeat, kill it, watch the status change to UNREACHABLE:
github.com/AxmeAI/ai-agent-health-monitoring
Built with AXME - health monitoring, heartbeat, and fleet visibility for AI agents. Alpha - feedback welcome.

Top comments (0)