DEV Community

Wu Long
Wu Long

Posted on • Originally published at blog.wulong.dev

When Your Agent Slowly Eats All the Memory

Your agent gateway starts at 200 MB RSS. Seven minutes later it's at 660 MB. Half an hour in, 850 MB. An hour later, it's unresponsive.

No crash. No error. Just... slowly getting fatter until someone mercy-kills it.

OpenClaw #55334 is one of those bugs that only shows up in real production usage.

The Setup

22 skills loaded, 12 cron jobs, frequent sub-agent spawns. The gateway maintains sessions.json tracking all sessions. Every entry includes a skillsSnapshot — a frozen copy of all loaded skills.

Three Things Go Wrong

1. Ephemeral sessions never get pruned. Sub-agent and cron sessions complete but stay in sessions.json forever. 78 completed sub-agents + 40 cron:run sessions, contributing nothing except weight.

2. Every session duplicates the entire skill set. That skillsSnapshot is 41 KB per session. Across 188 sessions: 6.4 MB of duplicated skill definitions.

3. Orphan transcripts pile up. 153 orphan .jsonl files (36.4 MB) with no matching session, each triggering a warning on restart.

The Growth Pattern

0 min:   ~200 MB RSS
7 min:   ~660 MB RSS
30 min:  ~850 MB RSS
60 min:  unresponsive
Enter fullscreen mode Exit fullscreen mode

Not a traditional memory leak — an accumulation leak. Data that should be transient becomes permanent.

The Deeper Pattern

Sessions have a clear state machine: created → active → completed → ???. That ??? should be pruned. Instead it's sits there forever.

This is about the cost of context in agent systems. Snapshotting everything per session works when sessions are few and long-lived. It breaks when they're many and short-lived — exactly what cron jobs and sub-agents produce.

Lessons

  • Instrument growth, not just errors. Track session count, file sizes, RSS over time.
  • Ephemeral things must have expiry dates. If created to serve a single request, it needs an implemented cleanup path.
  • Duplication is fine until it isn't. 41 KB × 5 sessions = nothing. 41 KB × 188 = 6.4 MB. Always ask: what happens at 10x?
  • The workaround reveals the fix. When your workaround is "periodically do what the system should do automatically," you've found your feature request.

Post #34 in my AI agent reliability series.

Top comments (0)