3 of Your AI Agents Crashed and You Found Out From Customers

#monitoring #ai #devops #python

You have 20 agents running across 4 machines. Order processing, refunds, inventory sync, email notifications. They've been running fine for weeks.

Monday afternoon, the order-processor agent on machine-3 gets OOM killed. Process gone. No error. No alert. The refund-agent that depended on it starts hanging too.

You find out at 5:45 PM when a customer emails: "My refund has been pending for 3 hours."

The Monitoring Gap Nobody Talks About

Traditional services have health checks. Kubernetes has liveness probes. Load balancers have health endpoints. When a web server dies, something notices within seconds.

AI agents have none of this.

LangGraph:  No health monitoring. Agent runs or doesn't.
CrewAI:     No heartbeat. No fleet visibility.
AutoGen:    No built-in health checks across agents.
Raw Python: Hope someone checks the process list.

Your agent is a Python process. When it dies, it's just a missing PID. No health endpoint. No heartbeat. No dashboard showing 19/20 agents healthy.

The standard answer is "use Kubernetes" or "use systemd." Those track process liveness. They don't track agent health. An agent can be alive but stuck - processing zero tasks, blocked on a downstream dependency, spinning in an infinite retry loop. Process is running. Agent is useless.

What You End Up Building

Every team that runs agents at scale builds the same thing:

# heartbeat_sender.py - added to every agent
import redis
import time
import threading

r = redis.Redis()

def heartbeat_loop():
    while True:
        r.set(f"heartbeat:{AGENT_ID}", time.time())
        time.sleep(30)

threading.Thread(target=heartbeat_loop, daemon=True).start()

Plus the checker:

# health_checker.py - separate process
def check_agents():
    agents = r.smembers("registered_agents")
    for agent_id in agents:
        last_ping = r.get(f"heartbeat:{agent_id}")
        if last_ping is None:
            continue
        elapsed = time.time() - float(last_ping)
        if elapsed > 90:
            send_pagerduty_alert(f"{agent_id} unreachable")
        elif elapsed > 60:
            send_slack_alert(f"{agent_id} degraded")

Plus Redis infrastructure. Plus Slack webhooks. Plus PagerDuty integration. Plus a dashboard. Plus agent registration. Plus cleanup for agents that were intentionally stopped vs ones that crashed.

Every team builds this. Every team maintains it. Every team's version has slightly different bugs.

What This Should Look Like

from axme import AxmeClient, AxmeClientConfig
import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

# Start heartbeat (background thread, every 30s)
client.mesh.start_heartbeat(interval_seconds=30)

# Agent does its normal work
while True:
    task = get_next_task()
    result = process(task)
    client.mesh.report_metric(success=True, latency_ms=result.duration_ms, cost_usd=result.cost)

Three lines of setup. The platform handles heartbeat tracking, status transitions, alerting, and the dashboard.

From any monitoring service:

result = client.mesh.list_agents()
for agent in result["agents"]:
    print(f"{agent['display_name']}: {agent['health_status']} (last: {agent['last_heartbeat_at']})")

# order-processor:  healthy      (last: 2026-04-01T14:30:02+00:00)
# refund-agent:     healthy      (last: 2026-04-01T14:30:05+00:00)
# inventory-sync:   degraded     (last: 2026-04-01T14:29:32+00:00)
# email-sender:     unreachable  (last: 2026-04-01T14:27:00+00:00)

Four Health States

Status	What It Means	How It's Triggered
HEALTHY	Running, reporting normally	Heartbeat received on time
DEGRADED	Running, but heartbeat is late	No heartbeat for 90-300 seconds
UNREACHABLE	Stopped sending heartbeats	No heartbeat for 300+ seconds
KILLED	Intentionally terminated	Explicit shutdown or kill command

The key distinction: DEGRADED vs UNREACHABLE.

DEGRADED means the heartbeat is late (90-300 seconds). The agent might be stuck or overloaded.

UNREACHABLE means no heartbeat for over 5 minutes. The agent is likely down.

This distinction matters because the response is different. Degraded - investigate. Unreachable - restart immediately.

Timeline: Monday With vs Without

Without health monitoring:

14:30  order-processor OOM killed
14:30  No alert
15:00  refund-agent hangs (downstream dep gone)
15:00  No alert
17:45  Customer: "My refund has been pending for 3 hours"
17:50  Engineer SSHs into machine-3
17:55  "Oh. It's been dead since 2:30."
18:10  Restart. Begin processing backlog.

3 hours 15 minutes of silent failure. Customer-reported.

With AXME mesh:

14:30  order-processor misses heartbeat
14:31  Status: HEALTHY -> UNREACHABLE
14:31  Alert: "order-processor on machine-3 unreachable"
14:32  Engineer sees alert, checks dashboard
14:33  refund-agent status: DEGRADED (downstream timeout)
14:35  Restart order-processor. Both agents recover.

5 minutes. No customer impact.

The Pattern: Observability for Agents

Web services have been doing this for 20 years. Health checks, readiness probes, metrics endpoints, dashboards. The tooling is mature.

AI agents are running the same way we ran web services in 2005 - deploy it, hope it works, find out when users complain.

The monitoring patterns are the same:

Heartbeat - periodic "I'm alive" signal
Status reporting - "I'm alive AND here's how I'm doing"
Fleet view - see all agents in one place
Alerting - notify when something changes
History - when did it go down? How long was it out?

The difference is where these run. Web services have infrastructure that assumes health checks exist. Agent frameworks assume agents are ephemeral scripts that run and exit. Long-running agents fall through the gap.

Beyond Liveness: Application-Level Metrics

Process monitoring tells you the PID exists. Application-level metrics tell you the agent is actually doing useful work.

# Report metrics with each processed task
client.mesh.report_metric(success=True, latency_ms=230, cost_usd=0.03)

# Failed task
client.mesh.report_metric(success=False, latency_ms=4500, cost_usd=0.01)

Metrics are buffered and sent with the next heartbeat. The dashboard shows intents processed, success rate, latency, and cost per agent.