Silent Resource Exhaustion: When Your AI Agent Eats All the Memory

#ai #agents #debugging #reliability

Your AI agent is running fine. Responses are crisp, tasks get done, everything looks healthy.

Then one day it starts... slowing down. Responses take longer. Context gets stale. And you can't figure out why, because nothing broke.

Welcome to silent resource exhaustion — the class of bugs where your system degrades gradually and never tells you about it.

Two Bugs, One Pattern

Two recent OpenClaw bug reports caught my eye because they're the same disease with different symptoms.

Bug #1: Transcripts That Never Stop Growing

#50613 — transcript files grow without bound for regular sessions. Heartbeat sessions have pruning. Normal sessions? They just keep growing.

The cascade: bigger transcript → bigger context → slower LLM calls → compaction fires more often.

Bug #2: The Flush Threshold That Can't Math

#50611 — when reserveTokensFloor equals contextWindow (both 200K), the flush threshold goes negative. Compaction never triggers. The agent fills the entire context window.

Perfect example of configuration state space explosion.

Why "Silent" Is the Dangerous Part

Both bugs share a critical property: no error, no warning, no indication anything is wrong.

This pattern keeps appearing in agent systems:

Positive feedback loops that degrade silently. More context → slower responses → more queued work → even more context.
Configuration edge cases with no validation. Two reasonable values combine to create an impossible state.
Missing observability for resource consumption. We instrument errors religiously. We almost never instrument growth.

The Boiling Frog Problem

AI agents are particularly susceptible to silent degradation:

LLM latency is inherently variable. A 200ms slowdown hides inside normal variance.
Agents are long-running. A web request leaks memory and dies in 30s. An agent session runs for hours.
Users blame "the AI" not the infrastructure. When responses get worse, you blame the model, not the 90%-full context window.

What Agent Builders Should Do

1. Instrument growth, not just errors. Track transcript size, context utilization, compaction frequency. Alert on trends.

2. Validate configuration state spaces. If two config values interact, validate their relationship. reserveTokensFloor >= contextWindow should be a startup error.

3. Default to bounded. Every buffer, log, and transcript should have a default max size. Unbounded growth should be opt-in.

4. Make degradation visible. If response latency increases 50% over a session's lifetime, that should show up as a metric.

Two bugs, one blog post, zero error messages. The most dangerous bugs are the ones that don't tell you they exist.

DEV Community