5 ways your AI agent runtime silently dies overnight (and the boring fix for each)

#ai #agents #devops #llms

I ran the same agent for thirty straight days. It died five times. Four of them did not show up in any log I had set up ahead of time, which is the part that bothers me most.

By the end I had a checklist of things that take an agent down at 2am while you're asleep, and none of them are the dramatic failures that get blog posts. They are all dull.

Here is the list.

1. OOM during a long tool-call loop

The agent is happily looping through 200 tool calls in one task. Each call returns a response. The agent appends every response to its working context plus an internal trace it writes to disk. Around call 150, RAM usage starts going up faster than usage going down. By call 180, the kernel OOM-killer wakes up and ends the process.

In the log: nothing. The agent's stdout cuts off mid-sentence. The supervisor logs say "process exited 137" which is the OOM signal but very few people read it that way the first time.

The boring fix: cgroup memory limits with a soft warning at 80%, plus a tool-call counter that flushes the working trace to disk every 25 calls and resets the in-memory copy. Not exotic. Just remembering that long agent loops are basically a memory leak unless you actively flush.

2. File descriptor exhaustion

Day eleven. The agent had been making API calls all day. A new tool call started and immediately got OSError: too many open files. The agent caught the exception, tried to retry, got the same error, gave up, returned an error to the user.

The agent itself didn't crash. It just stopped being useful. The supervisor process saw "agent returned an error" and moved on. Nothing alerted.

What actually happened: the agent's HTTP client was reusing a session pool that didn't close idle sockets, and over 11 days it had accumulated about 950 open FDs against the per-process default of 1024. Every new HTTP call added to the pool. Eventually it ran out.

The boring fix: explicit session lifecycle with a timeout, a daily restart of the agent process, and ulimit -n raised to something sane (16384 on the runtimes I cared about). The daily restart is the cheap one. People resist it because it feels primitive, but every long-running daemon I have ever shipped survives on a daily restart somewhere in the stack.

3. Context window bloat

This one I had read about, but it still got me. The agent's working context grew to about 180,000 tokens by hour 60 of a multi-day task. Each new tool call cost more than the last because the model was paying to re-read the entire history. By hour 65 a single tool call was taking 90 seconds and burning through the per-minute rate limit, which the agent interpreted as "the API is down" and went into a backoff loop.

The agent didn't crash. It just got slower and slower until it was producing nothing, and the bill kept going up.

The boring fix: a context summarizer that runs every N tool calls, replaces the last K turns with a one-paragraph summary, and keeps the most recent 5 turns verbatim. This is well-trodden ground in the literature, but the surprising part is how rarely small teams actually wire it up. Most agent codebases I have looked at assume the conversation will end in a few turns. Long-running agents need garbage collection on their own conversation history.

If you want a longer treatment of why AI agent hosting is mostly about boring problems like this one rather than the model itself, the longer version is worth a skim. The summary: the model is the easy part now. Everything around it is where the failures live.

4. ulimit walls (max user processes)

Day nineteen. The agent had spawned a background process for a long-running task and then gone on to do other work. Background tasks accumulated. By midnight there were 287 zombie processes attached to the agent's user, the per-user max user processes limit was somewhere around 1024 in this environment, and at 03:14 a new spawn failed with Resource temporarily unavailable.

In the log: a single line saying the spawn failed. The agent caught it as a generic exception and continued. The user-facing behavior was "this task takes forever." Three days later when I finally noticed I had to manually reap the zombies.

The boring fix: a process supervisor that owns the lifecycle of every spawned task, kills anything that has been alive longer than its declared TTL, and treats child processes as a resource that needs to be tracked. setsid and prctl(PR_SET_PDEATHSIG) are your friends. Also raise ulimit -u to something generous, but the real fix is killing things on schedule.

5. Webhook timeouts that look like success

Last one, and the meanest. The agent finished a task and called a webhook to notify a downstream system. The webhook took 31 seconds to respond. The HTTP client had a 30 second timeout. The client raised a timeout error. The agent's wrapper caught the timeout and logged a success because the wrapper had been written assuming "timeout means delivered, the receiver was just slow."

This is true for some kinds of fire-and-forget delivery. It is catastrophic for any kind of state-changing call. The downstream system never received the call. The agent thought it had. The user-facing system had two views of the world that did not agree.

In the log: a success line. No error. Nothing wrong.

The boring fix: idempotency keys on every state-changing webhook, a status check after every call that crossed the timeout threshold, and never treat a timeout as success without a separate confirmation. A timeout tells you the status is unknown. It does not tell you the call was delivered.

The pattern across all five

Every one of these failures had the same shape: a long-running agent ran into a resource limit or a state assumption that was fine for short tasks and broken for multi-day ones. The agent itself did not crash in three of the five cases. It just stopped being useful, and the supervisor was not watching for that.

The hosting layer needs to do three things that aren't sexy:

Memory and FD limits with warnings before the hard cap, not at the hard cap
Process lineage tracking so spawned tasks can't outlive their parent's intention
State-changing call confirmation, not just transport-level success

If you are running an agent on your laptop for an hour, none of this matters. If you are hosting OpenClaw agents in production for paying customers, all of this matters more than the model you picked.

What I now log on day one of any agent project

The thing that would have saved me the most pain on this run is just better logging from the start. None of the metrics below are exotic. None of them require an APM vendor. They are just the things I now scrape from any agent process before letting it run for more than 24 hours.

Per agent loop I track: RSS memory, FD count, child process count, total tool calls in this loop, total context tokens, time since last tool call returned, and the result of the last 10 tool calls (success or specific error). Per host I track: free memory, total FDs in use, load average, and the pid count for the agent's user. Both go to a flat file with a timestamp. No dashboard. Just a thing I can grep when something is weird.

Five of those metrics would have caught four of the five failures I described, hours or days before they actually broke things. The fifth (the webhook timeout) needs application-level logging, not host-level. That one is on the developer of the wrapper.

I have a longer guide on the hosting end of this at https://rapidclaw.dev/blog/ai-agent-hosting-complete-guide, but if you read nothing else, read this: the most expensive failure mode is the one that doesn't crash the process. Crashes get noticed. Slow-degrading agents do not.