George Belsky

Posted on Apr 5

Your AI Agent Stopped Responding 2 Hours Ago. Nobody Noticed.

#ai #python #agents #monitoring

Your agent is deployed. Pod is running. Container passes liveness probes. Grafana shows a flat green line. Everything looks fine.

Except the agent stopped processing work 2 hours ago. It's alive - the process is there - but it's stuck. Deadlocked on a thread. Blocked on a full queue. Spinning in a retry loop that will never succeed. Silently swallowing exceptions in a while True.

Nobody knows until a customer reports it. Or until someone opens a dashboard at 5 PM and wonders why the task queue has been growing all afternoon.

Why Container Health Checks Don't Work for Agents

Kubernetes liveness probes check one thing: is the process responding to HTTP? If your agent serves a /healthz endpoint, the probe passes. The agent is "healthy."

But responding to /healthz and processing work are two different things. An agent can:

Deadlock on an internal lock while still serving HTTP
OOM-kill its worker thread while the main thread stays alive
Enter an infinite retry loop on a broken downstream API
Silently drop into a except: pass branch and stop doing anything

The process is running. The container is green. The agent is useless.

Container health check:  "Is the process alive?"       YES
What you actually need:  "Is the agent doing work?"    NO

This gap exists because container orchestration was designed for stateless web servers, not for long-running agents that hold state, maintain connections, and process work asynchronously.

The Heartbeat Pattern

The fix is old. Web services solved this 15 years ago with heartbeat monitoring. The idea is simple: the agent periodically reports "I am alive and working." If the report stops, something is wrong.

The difference between a health check and a heartbeat: health checks are passive (something pings you), heartbeats are active (you report out). A stuck agent can't respond to pings, but a stuck agent also can't send heartbeats. That's the point.

But building heartbeat infrastructure for agents means:

# 1. Heartbeat sender (added to every agent)
import threading, time, requests

def heartbeat_loop(agent_id, interval=30):
    while True:
        try:
            requests.post(
                "https://monitoring.internal/heartbeat",
                json={"agent_id": agent_id, "ts": time.time()},
                timeout=5,
            )
        except Exception:
            pass
        time.sleep(interval)

threading.Thread(target=heartbeat_loop, args=("my-agent",), daemon=True).start()

# 2. Heartbeat checker (separate cron process)
# 3. Redis/Postgres for heartbeat storage
# 4. Alerting rules (Slack, PagerDuty)
# 5. Dashboard showing last-seen times
# 6. Logic to distinguish "stopped intentionally" from "crashed"
# 7. Cleanup for deregistered agents

That's a monitoring system. For each agent framework you use, for each deployment environment, maintained forever.

One Line Instead

from axme import AxmeClient, AxmeClientConfig
import os

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))
client.mesh.start_heartbeat()

That's it. A daemon thread wakes up every 30 seconds, sends a heartbeat to the platform, and goes back to sleep. When the agent stops - crash, deadlock, OOM, network partition - the heartbeats stop. The platform notices.

No Redis. No cron. No Prometheus. No webhook integrations. No alerting rules to maintain.

How Health Is Computed

The platform tracks the timestamp of each heartbeat and computes health automatically:

Time Since Last Heartbeat	Status	What It Means
< 90 seconds	healthy	Agent is alive and reporting
90 - 300 seconds	degraded	Agent may be stuck or overloaded
> 300 seconds	unreachable	Agent is down or not reporting
Manual kill	killed	Operator explicitly blocked this agent

The thresholds are designed around the 30-second default interval. A healthy agent with interval_seconds=30 sends a heartbeat every 30 seconds. If the platform hasn't heard from it in 90 seconds (3 missed heartbeats), something is probably wrong. If 5 minutes pass, it's gone.

The degraded state is the useful one. It's the early warning. The agent isn't dead yet, but it's missed a couple of beats. Maybe the event loop is under load. Maybe a GC pause ate 45 seconds. Maybe the network is flaky. You have a window to investigate before the agent goes fully unreachable.

What Happens When an Agent Goes Down

Here's the timeline with heartbeat monitoring:

00:00  Agent starts. Heartbeat begins.
00:30  Heartbeat sent. Status: healthy.
01:00  Heartbeat sent. Status: healthy.
01:15  Agent deadlocks on a database connection pool.
01:30  No heartbeat. (Agent is stuck, can't send.)
02:00  No heartbeat for 90s. Status: healthy -> degraded.
02:00  Platform logs state transition.
05:15  No heartbeat for 300s. Status: degraded -> unreachable.
05:15  Platform blocks new intent delivery to this agent.

Without heartbeat monitoring:

00:00  Agent starts.
01:15  Agent deadlocks.
...
...
03:15  Someone notices the task queue growing.
03:30  Engineer SSHs in. "The process is running."
03:45  "The container is green. Logs look... wait, no new logs since 1:15."
04:00  Engineer restarts the agent.

The difference: 2 minutes vs 2.75 hours. And the first scenario is automatic - no human needs to notice anything.

Heartbeat with Metrics

The heartbeat isn't just a ping. It can carry operational metrics, flushed automatically with each beat:

client.mesh.start_heartbeat(include_metrics=True)

# As the agent processes work, report metrics
client.mesh.report_metric(success=True, latency_ms=234.5, cost_usd=0.003)
client.mesh.report_metric(success=False, latency_ms=5012.0)

# Metrics are buffered in memory and sent with the next heartbeat
# No separate metrics pipeline needed

Every 30 seconds, the heartbeat sends both "I'm alive" and "here's how I'm doing" - success rate, average latency, cost accumulation. The platform aggregates per agent and exposes it through the CLI and dashboard.

This turns the heartbeat from a binary alive/dead signal into a continuous health signal. An agent that's alive but processing tasks at 20x normal latency shows up before it becomes a problem.

Kill and Resume

Sometimes an agent needs to be stopped. Not crashed - intentionally blocked. Maybe it's misbehaving. Maybe you're doing maintenance. Maybe it's burning through your API budget.

# From code (address_id from list_agents)
client.mesh.kill("addr_abc123")

A killed agent enters the killed state. Even if its heartbeat thread is still running, the gateway keeps it killed. No intents are delivered. It stays killed until explicitly resumed:

client.mesh.resume("addr_abc123")

Or kill/resume from the dashboard at mesh.axme.ai with one click.

This is different from the agent crashing. A crash leads to unreachable. A kill is deliberate. The distinction matters for alerting - you don't want to page on-call for an agent you intentionally stopped.

Fleet Visibility

When you have 20 agents across 4 machines, the dashboard matters more than any individual heartbeat.

The AXME Mesh Dashboard at mesh.axme.ai shows complete fleet health in real time:

Open it with:

axme mesh dashboard
report-generator         killed        (manual)

Summary: 2 healthy, 1 degraded, 1 unreachable, 1 killed

One command. Complete fleet health. No SSH. No Grafana. No log aggregation pipeline.

The Real Cost of Silent Failures

Every team running agents at scale has the same story. An agent went down on Friday afternoon. Nobody noticed until Monday morning. 60 hours of missed processing. Customer complaints. Backlog that took another 8 hours to clear.

The fix isn't complicated. It's one function call. The hard part is remembering that containers passing health checks is not the same as agents doing work.

client.mesh.start_heartbeat()

That's the whole fix.

Try It

Working example - start an agent with heartbeat, kill the process, watch the status transition from healthy to degraded to unreachable:

github.com/AxmeAI/ai-agent-heartbeat-monitoring

Built with AXME - heartbeat, health detection, and fleet monitoring for AI agents. Alpha - feedback welcome.

DEV Community