What I wish I knew before running AI agents 24/7

#ai #monitoring #agents

I've been running long-lived AI agents in production for a while now. The specific workloads changed over time, but the operational failures were surprisingly consistent.

What follows is the setup I wish I had from day one. None of it depends on a specific framework. If you run an LLM in a loop, poll external APIs, or make decisions on a schedule, these patterns matter.

1. Your agent will die and not tell you

The first time one of my agents crashed overnight, I lost hours before I noticed. There was no error log because it was not an application error in the usual sense — the OS killed the process for memory.

What I do now: Every agent sends a heartbeat from inside its main loop. Not a separate health-check thread. Not a sidecar process. From the actual loop.

That distinction matters. If the main loop is stuck on I/O, deadlocked on a lock, or wedged inside a retry path, an external "process is up" check tells you very little.

Here is the minimal pattern:

import time


def heartbeat(agent: str, *, status: str = "ok", tokens_used: int = 0) -> None:
    # Send this to whatever monitoring system you use.
    print(
        {
            "agent": agent,
            "status": status,
            "tokens_used": tokens_used,
            "ts": time.time(),
        }
    )


while True:
    tokens_used = 0

    result = run_agent_cycle()
    tokens_used += result.tokens_used

    heartbeat("my-agent", status="ok", tokens_used=tokens_used)
    time.sleep(60)

If the heartbeat stops, something is wrong. I usually check every 60 seconds and alert after 2 missed beats.

2. Auto-restart is harder than you think

"Just restart it" sounds simple until you hit edge cases:

Restart loops: A bad config causes the agent to crash immediately after starting. Without a cooldown, you get crash → restart → crash → restart forever.
Platform differences: Docker restart policies work well. launchd on macOS silently fails if the service domain is wrong. systemd needs a RestartSec or it can spin.
State corruption: If your agent crashed mid-write to a state file, restarting puts it in an inconsistent state.

What I do now: 5-minute cooldown between restarts. After 3 failed restarts, stop trying and alert me. On restart, the agent validates its state before resuming.

A good restart policy is less like "always restart" and more like:

missed heartbeats -> mark unhealthy
restart once -> wait 5 minutes
restart again -> wait 5 minutes
restart third time -> stop auto-restarting, escalate to human

3. LLM cost is a health metric

This was my biggest insight. For traditional services, you monitor CPU, memory, and latency. For LLM agents, token cost per cycle is often the metric that catches problems first.

A runaway loop doesn't spike CPU (API calls are I/O bound). It doesn't spike memory. But token usage goes from 200/min to 40,000/min instantly. If you're not tracking cost per cycle, you'll find out from your API bill.

The simplest version of this is a moving baseline:

baseline = rolling_average(tokens_per_cycle[-50:])

if tokens_used_this_cycle > baseline * 10:
    alert("possible loop", tokens_used=tokens_used_this_cycle)

4. Graceful shutdown is not optional

One of my agents sends a burst of API calls during shutdown to finish cleanup safely. The first time I added loop detection, it flagged every graceful shutdown as a runaway.

What I do now: The agent signals "shutting down" before cleanup. The monitoring system knows to expect a burst and does not flag it.

5. Daily reports catch the slow problems

Alerts catch emergencies. Daily reports catch slower drift that alerts miss — an agent that is gradually using more tokens per cycle, or one that restarts once a day at the same time because of a cron conflict.

I review a daily summary of each agent's health, cost, and event history. Most of my operational improvements came from patterns in that report, not from real-time alerts.

The basic report I want every morning is:

- Was the agent alive the whole day?
- How many restart events happened?
- Did token cost per cycle move outside baseline?
- Were there loop-detection or cooldown events?
- Did anything get auto-recovered, or does it need a human?

These patterns aren't complicated, but I didn't find them written down anywhere when I started. Hopefully this saves someone a few "learning experiences."

If you want to see what my setup looks like, I built these ideas into ClevAgent. But honestly, even a homegrown heartbeat plus cost-per-cycle tracker gets you most of the way there.