Your monitoring dashboard shows green across the board. Process running. Port responding. CPU normal. Memory stable.
But your AI agent hasn't done anything useful in four hours.
The problem with traditional health checks
Traditional health checks answer one question: "Is the process alive?" For web servers, that's usually enough. If Nginx is running and responding on port 80, it's probably serving pages.
AI agents are different. An agent can be alive without being productive. The process is running, but the main work loop is stuck on a hung HTTP call, waiting on a deadlocked mutex, or spinning in a retry loop that will never succeed.
Three ways health checks lie
1. PID exists ≠ working
systemctl status my-agent says "active (running)". But the agent's main loop has been blocked on requests.get() for three hours because an upstream API rotated its TLS certificate and the connection is hanging without a timeout.
The health check thread runs independently and reports "I'm fine" every 30 seconds.
2. Port responds ≠ working
Many agents expose an HTTP health endpoint. A load balancer pings /health, gets 200 OK, and assumes everything is fine.
But the /health handler runs on a different thread from the agent's work loop. The work loop is dead. The health endpoint is alive. Two completely different things.
3. No errors ≠ working
Your error tracking shows zero exceptions. Must be healthy, right?
Except the agent is caught in a logic loop: parse response → ask LLM to fix → get the same malformed response → repeat. Every request succeeds. Every response is valid. The agent just isn't making progress, and it's burning through API credits at 200x the normal rate.
What actually works
There are two levels of heartbeat protection, and they catch different failures.
Level 1 — Liveness heartbeat (background thread or sidecar). This proves the process is alive. It catches crashes, OOM kills, and clean exits. But it doesn't catch zombies — the health-check thread keeps ticking even when the work loop is stuck on a hung API call.
Level 2 — Work-progress heartbeat (inside the work loop). This proves the agent is doing useful work:
while True:
data = fetch_data() # If this hangs...
result = process(data)
heartbeat() # ...this never fires
sleep(interval)
If heartbeat() doesn't fire within the expected interval, something is wrong. You don't need to know what — you need to know when.
A background-thread heartbeat is better than nothing because it solves the silent-exit problem. But for zombie failures, the heartbeat needs to come from inside the loop that does the actual work. For full coverage, use both.
Adding cost as a health signal
For LLM-backed agents, there's a third dimension: cost per cycle. A runaway loop doesn't spike CPU because LLM calls are I/O-bound. But it does spike token usage.
Track tokens per heartbeat cycle. If it jumps 10-100x above baseline, you have a loop even if every other metric says "healthy."
The monitoring stack for AI agents
| Signal | Web server | AI agent |
|---|---|---|
| Is it alive? | Process check | Positive heartbeat |
| Is it working? | Request latency | Heartbeat from work loop |
| Is it healthy? | Error rate | Cost per cycle |
The minimum version is simple: put a heartbeat inside your main loop, include token count, and alert on silence and cost spikes. That catches most AI agent failures that traditional monitoring misses.
I originally wrote this pattern up after debugging long-running agent failures in production. If you want the fuller walkthrough, the canonical version lives on the ClevAgent blog.
Top comments (0)