DEV Community

George Belsky
George Belsky

Posted on

3 of Your AI Agents Crashed and You Found Out From Customers

You have 20 agents running across 4 machines. Order processing, refunds, inventory sync, email notifications. They've been running fine for weeks.

Monday afternoon, the order-processor agent on machine-3 gets OOM killed. Process gone. No error. No alert. The refund-agent that depended on it starts hanging too.

You find out at 5:45 PM when a customer emails: "My refund has been pending for 3 hours."

The Monitoring Gap Nobody Talks About

Traditional services have health checks. Kubernetes has liveness probes. Load balancers have health endpoints. When a web server dies, something notices within seconds.

AI agents have none of this.

LangGraph:  No health monitoring. Agent runs or doesn't.
CrewAI:     No heartbeat. No fleet visibility.
AutoGen:    No built-in health checks across agents.
Raw Python: Hope someone checks the process list.
Enter fullscreen mode Exit fullscreen mode

Your agent is a Python process. When it dies, it's just a missing PID. No health endpoint. No heartbeat. No dashboard showing 19/20 agents healthy.

The standard answer is "use Kubernetes" or "use systemd." Those track process liveness. They don't track agent health. An agent can be alive but stuck - processing zero tasks, blocked on a downstream dependency, spinning in an infinite retry loop. Process is running. Agent is useless.

What You End Up Building

Every team that runs agents at scale builds the same thing:

# heartbeat_sender.py - added to every agent
import redis
import time
import threading

r = redis.Redis()

def heartbeat_loop():
    while True:
        r.set(f"heartbeat:{AGENT_ID}", time.time())
        time.sleep(30)

threading.Thread(target=heartbeat_loop, daemon=True).start()
Enter fullscreen mode Exit fullscreen mode

Plus the checker:

# health_checker.py - separate process
def check_agents():
    agents = r.smembers("registered_agents")
    for agent_id in agents:
        last_ping = r.get(f"heartbeat:{agent_id}")
        if last_ping is None:
            continue
        elapsed = time.time() - float(last_ping)
        if elapsed > 90:
            send_pagerduty_alert(f"{agent_id} unreachable")
        elif elapsed > 60:
            send_slack_alert(f"{agent_id} degraded")
Enter fullscreen mode Exit fullscreen mode

Plus Redis infrastructure. Plus Slack webhooks. Plus PagerDuty integration. Plus a dashboard. Plus agent registration. Plus cleanup for agents that were intentionally stopped vs ones that crashed.

Every team builds this. Every team maintains it. Every team's version has slightly different bugs.

What This Should Look Like

from axme import AxmeClient, AxmeClientConfig
import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

# Start heartbeat (background thread, every 30s)
client.mesh.start_heartbeat(interval_seconds=30)

# Agent does its normal work
while True:
    task = get_next_task()
    result = process(task)
    client.mesh.report_metric(success=True, latency_ms=result.duration_ms, cost_usd=result.cost)
Enter fullscreen mode Exit fullscreen mode

Three lines of setup. The platform handles heartbeat tracking, status transitions, alerting, and the dashboard.

From any monitoring service:

result = client.mesh.list_agents()
for agent in result["agents"]:
    print(f"{agent['display_name']}: {agent['health_status']} (last: {agent['last_heartbeat_at']})")

# order-processor:  healthy      (last: 2026-04-01T14:30:02+00:00)
# refund-agent:     healthy      (last: 2026-04-01T14:30:05+00:00)
# inventory-sync:   degraded     (last: 2026-04-01T14:29:32+00:00)
# email-sender:     unreachable  (last: 2026-04-01T14:27:00+00:00)
Enter fullscreen mode Exit fullscreen mode

Four Health States

Status What It Means How It's Triggered
HEALTHY Running, reporting normally Heartbeat received on time
DEGRADED Running, but heartbeat is late No heartbeat for 90-300 seconds
UNREACHABLE Stopped sending heartbeats No heartbeat for 300+ seconds
KILLED Intentionally terminated Explicit shutdown or kill command

The key distinction: DEGRADED vs UNREACHABLE.

DEGRADED means the heartbeat is late (90-300 seconds). The agent might be stuck or overloaded.

UNREACHABLE means no heartbeat for over 5 minutes. The agent is likely down.

This distinction matters because the response is different. Degraded - investigate. Unreachable - restart immediately.

Timeline: Monday With vs Without

Without health monitoring:

14:30  order-processor OOM killed
14:30  No alert
15:00  refund-agent hangs (downstream dep gone)
15:00  No alert
17:45  Customer: "My refund has been pending for 3 hours"
17:50  Engineer SSHs into machine-3
17:55  "Oh. It's been dead since 2:30."
18:10  Restart. Begin processing backlog.
Enter fullscreen mode Exit fullscreen mode

3 hours 15 minutes of silent failure. Customer-reported.

With AXME mesh:

14:30  order-processor misses heartbeat
14:31  Status: HEALTHY -> UNREACHABLE
14:31  Alert: "order-processor on machine-3 unreachable"
14:32  Engineer sees alert, checks dashboard
14:33  refund-agent status: DEGRADED (downstream timeout)
14:35  Restart order-processor. Both agents recover.
Enter fullscreen mode Exit fullscreen mode

5 minutes. No customer impact.

The Pattern: Observability for Agents

Web services have been doing this for 20 years. Health checks, readiness probes, metrics endpoints, dashboards. The tooling is mature.

AI agents are running the same way we ran web services in 2005 - deploy it, hope it works, find out when users complain.

The monitoring patterns are the same:

  1. Heartbeat - periodic "I'm alive" signal
  2. Status reporting - "I'm alive AND here's how I'm doing"
  3. Fleet view - see all agents in one place
  4. Alerting - notify when something changes
  5. History - when did it go down? How long was it out?

The difference is where these run. Web services have infrastructure that assumes health checks exist. Agent frameworks assume agents are ephemeral scripts that run and exit. Long-running agents fall through the gap.

Beyond Liveness: Application-Level Metrics

Process monitoring tells you the PID exists. Application-level metrics tell you the agent is actually doing useful work.

# Report metrics with each processed task
client.mesh.report_metric(success=True, latency_ms=230, cost_usd=0.03)

# Failed task
client.mesh.report_metric(success=False, latency_ms=4500, cost_usd=0.01)
Enter fullscreen mode Exit fullscreen mode

Metrics are buffered and sent with the next heartbeat. The dashboard shows intents processed, success rate, latency, and cost per agent.

The Dashboard

The AXME Mesh Dashboard at mesh.axme.ai shows your entire fleet health in real time - status, last heartbeat, cost, and alerts in one view:

Agent Mesh Dashboard

No log diving. No Grafana setup. No custom alerting pipelines.

Try It

Working example - register an agent, start heartbeat, kill it, watch the status change to UNREACHABLE:

github.com/AxmeAI/ai-agent-health-monitoring

Built with AXME - health monitoring, heartbeat, and fleet visibility for AI agents. Alpha - feedback welcome.

Top comments (0)