When the AI's memory explodes: context overflow and compaction failures in production

#llm #infrastructure #ai #devops

This afternoon, OpenClaw stopped responding.

Not crashed. Not throwing errors. Just silence. Then a tool call 30 seconds later. Then more silence. A response that came too late.

Root cause: context overflow at 119% of the model's limit — 156,000 tokens against a 131,072-token ceiling.

What is a context window, exactly?

When you interact with an LLM, everything the model "knows" during a conversation lives in a fixed-size buffer called the context window, measured in tokens (~0.75 words per token).

For venice/claude-sonnet-4-6, the hard limit is 131,072 tokens.

On every single message, my agent injects:

MEMORY.md — long-term curated memory (~2,000 tokens)
USER.md, SOUL.md, AGENTS.md — behavioral config (~3,500 tokens)
System prompt and skill definitions (~11,000 tokens)
Tool schemas — JSON definitions of every callable tool (~5,500 tokens)
The full conversation history — this is the variable that grows unbounded

After a few hours of active conversation — especially with multiple tool calls (each generating input/output pairs in the context) — the history alone can hit 80,000+ tokens. Add the fixed overhead: you're at 119%.

Why compaction failed

Compaction is the automatic process of summarizing old conversation to free up space. Think of it as the agent writing cliff notes of what was discussed, replacing the verbatim transcript with a compact summary.

The failure was a cascade:

⚙️ Compaction failed: Compaction cancelled • Context 156k/131k (119%)
⚙️ Compaction failed: Compaction cancelled • Context ?/1.0m
⚙️ Compaction failed: Compaction cancelled • Context 61k/131k (47%)

Context was already over the limit when compaction was triggered
Compaction itself requires tokens to generate the summary — no headroom
The system retried with progressively smaller targets, each time failing
Tool calls during recovery created additional context entries, making it worse

Classic resource exhaustion under load: the recovery mechanism needs the same resource it's trying to free.

Real-world impact

~30 minutes of degraded responsiveness (30–60s per response)
Repeated heartbeat failures — automated health checks timed out
Partial memory loss — any learnings not written to disk were gone
Interrupted workflows — active monitoring tasks broke mid-execution

The bots themselves were unaffected. They run on cron, independently of the AI agent. This separation of concerns was validated in the worst possible way — and it held.

What I'm changing

Short term:

Default to venice/llama-3.3-70b for routine tasks and heartbeats
Reserve heavier models (claude-sonnet, gemini-pro) for complex reasoning only
Truncate verbose tool outputs before they enter the context

Medium term:

Trigger compaction at 60% capacity, not 100%
Move periodic monitoring (bot health checks) to isolated cron sessions — they shouldn't pollute the main conversation context
Daily memory flush routine: write key context to disk before the session grows too large

The mental model shift

The AI context window is not free RAM.

Every injected token has a cost: latency, API cost, and eventual overflow risk. Treating it like unlimited working memory is the mistake I made here.

The right mental model is closer to a real-time database buffer with a hard flush deadline — you manage it proactively, or it manages you.

AI agents are not chatbots with longer conversations. They are stateful systems with real resource constraints. Context overflow is not an edge case — it's an expected failure mode in long-running agent sessions, and it needs to be engineered around like any other system resource.

Part of my build log — a public record of things I'm building, breaking, and learning at the intersection of AI, infrastructure, and Web3.