clearloop for CrabTalk

Posted on Mar 15 • Originally published at openwalrus.xyz

Async compaction: the race conditions nobody talks about

#ai #research #opensource #openwalrus

Claude Code blocks the agent while compacting. LangGraph runs compaction in the background and silently drops messages. Aider spawns a background thread and hopes for the best. Async compaction sounds like the obvious optimization — until you try to build it.

We surveyed how major frameworks handle context compaction timing — synchronous, asynchronous, or not at all — and catalogued the concurrency hazards that emerge when you move compaction off the critical path. Here's what we found.

Why compaction blocks

Most frameworks run compaction synchronously. The agent stops, the LLM summarizes, the agent continues with a shorter context. It's slow but safe.

Framework	Approach	Agent blocked	Race risk
Claude Code	Sync at 95% capacity	Yes	None
LangChain	Sync after turn	Yes	None
AutoGen	Sync between chats	Yes	None
Cursor	None (manual reset)	N/A	N/A
ChatGPT	None (manual)	N/A	N/A
Aider	Background thread	No	Medium
Google ADK	Async event-based	No	Medium
LangGraph	Async background	No	High

Six of eight frameworks either block or don't compact at all. The industry has voted with its implementations: synchronous compaction is the safe default.

The cost is real. LLM summarization takes 2–10 seconds depending on context size and model. During that window, the agent can't respond. For interactive use cases (coding assistants, chatbots), that's a noticeable hang. For background automation, it barely matters.

The five concurrency hazards

Moving compaction to a background task introduces five categories of concurrency bugs. We found evidence of all five in production frameworks.

1. Stale snapshot

Compaction reads the current message history, sends it to an LLM for summarization, and waits for the result. During that wait, new messages arrive. The compacted summary doesn't include them.

When the summary replaces the original history, the new messages are silently lost.

LangGraph's documented race: history is rebuilt from a stale snapshot then fully replaced, dropping items recorded during the compaction window. The proposed fix — version counters and generation IDs — is not yet implemented.

2. Silent message drop

This is the consequence of stale snapshots, but it deserves its own category because of how it manifests: the agent simply "forgets" recent context with no error, no warning, no log entry.

The user says "actually, use pnpm instead of yarn." Compaction starts. The compacted summary captures the pre-change state. The user's correction vanishes.

LangGraph's three-step async operation (snapshot → summarize → replace) can fail mid-way, leaving memory and disk out of sync. A partial failure means the summary was written but the old history wasn't fully removed — or vice versa.

3. Ordering violation

If multiple WHS services or agents compact in parallel, results arrive out of order. Service A compacts messages 1–50 while Service B compacts messages 30–60 with overlapping coverage. Which result wins? How do you merge overlapping compactions?

In single-service systems this is less likely. But in walrus — where memory, search, and channels are all WHS services that may declare the Compact capability — parallel compaction is a real scenario.

4. Failed rollback

Compaction produces a bad summary — it drops a critical fact, mischaracterizes a decision, or generalizes away an edge case. In synchronous compaction, you can validate before continuing. In async compaction, the agent has already acted on the pre-compaction context. By the time you detect the bad summary, the damage is done.

No framework we surveyed implements compaction rollback. The summary is treated as authoritative the moment it's produced.

5. Double compaction

Token threshold crossed → compaction starts in background → more messages arrive → threshold crossed again → second compaction starts. Two concurrent compactions now race on the same history.

LangGraph has no max_compact_attempts counter — infinite compaction retries are theoretically possible. The proposed fix includes a maximum attempt limit, but it's unimplemented.
[Interactive chart — see original post]
The chart tells a clear story: synchronous compaction (walrus's current approach) has zero concurrency risk. Every async implementation introduces hazards. LangGraph's are the most severe because its async design was retrofitted onto a system that assumed sequential execution.

How three frameworks handle async

Aider: background thread with weak model

Aider runs recursive summarization in a background thread using a cheaper "weak model" — a smaller, faster LLM that handles compression while the main model continues reasoning.

What works: the main agent is never blocked. Compaction cost is reduced by using a cheaper model. Recursive summarization (summary of summaries) keeps context compact over long sessions.

What's missing: no documented handling of what happens when the agent queries content that's currently being compacted. If the background thread hasn't finished and the agent needs the old context, it reads stale data or waits — defeating the purpose of async.

Google ADK: event-based async summarization

Google ADK triggers compaction via events and runs summarization asynchronously. The result is written back as a new event. A sliding window with overlap preserves the most recent messages.

What works: the event-based architecture means compaction is just another event in the stream. The overlap window (keeping the last N messages uncompacted) prevents the worst stale-snapshot problems — recent context always survives.

What's missing: ordering guarantees when events arrive during compaction are not documented. If the compaction event completes after several new user events, the insertion point matters. Google ADK doesn't specify whether the summary event is inserted at the position where compaction started or at the current head.

LangGraph: async with known race conditions

LangGraph attempts true async compaction but has documented concurrency bugs:

Silent drop: items recorded during the compaction window are lost when history is fully replaced
Partial failure: memory and disk can get out of sync if the three-step operation (snapshot → summarize → replace) fails mid-way
Unbounded retries: no maximum compaction attempt counter

The proposed fixes are sound — version counters, atomic replacement, max attempts — but none are implemented as of March 2026. LangGraph is the clearest evidence that async compaction is harder than it looks.
[Interactive chart — see original post]

What MemGPT got right: don't compact in the background

MemGPT (now Letta) takes a radically different approach: the agent controls its own memory tiers, like an operating system managing physical and virtual memory. The LLM context window is "physical memory." External storage is "virtual memory." The agent explicitly moves information between tiers via function calls.

No background compaction. No race conditions. The agent decides what to archive and what to recall. This is the only framework we surveyed with zero concurrency hazards.

The trade is cognitive overhead: the agent spends tokens reasoning about memory management instead of the actual task. MemGPT's approach is elegant but expensive in a different currency — model attention rather than infrastructure complexity.

The walrus problem

Walrus currently compacts synchronously. The on_compact() hook blocks the agent loop while WHS services return compacted context — tokio::task::block_in_place() bridges the async/sync gap. Each service has a 10-second timeout. Safe, but the agent hangs.

Moving to async compaction would look like this:

Agent loop detects context threshold → fires CompactSession event
Background tokio task dispatches to all Compact-capable WHS services
Services return compacted prompt additions
Results stored in session as "pending compaction"
Next on_before_run() injects pending compaction into the prompt
Agent continues immediately after step 1

This design uses walrus's existing event infrastructure — DaemonEvent variants, tokio::spawn(), the task watcher pattern in the task registry. No Hook trait changes required.

But all five hazards apply:

Stale snapshot: messages arrive between event fire (step 1) and result injection (step 5). The compacted summary doesn't include them. Fix: keep a generation counter on the session history. Reject compaction results if the generation has advanced beyond a threshold.

Silent drop: if pending compaction replaces history naively, messages from steps 2–4 vanish. Fix: merge, don't replace. Append the compaction summary alongside messages received during the compaction window, not instead of them.

Ordering: multiple WHS services may compact in parallel. Their results must be serialized. Fix: the existing RPC mutex on ServiceRegistry (already used for tool dispatch) can serialize compaction results. Alternatively, sequence compaction responses by service priority.

Failed rollback: a bad summary from a WHS service corrupts context. Fix: store pre-compaction history snapshot. If the agent detects degraded quality (a heuristic, not foolproof), restore from snapshot.

Double compact: threshold crossed again before first compaction completes. Fix: at most one compaction in flight per session. New threshold crossings set a "compact pending" flag but don't spawn another task.
[Interactive chart — see original post]
The timing chart shows why async is appealing: the agent is blocked for ~50ms (event dispatch) instead of ~5,000ms (full summarization). Total wall time is similar — the LLM still takes 5+ seconds — but the agent can work during that time.

Patterns that work

From surveying frameworks and Anthropic's context engineering guide, four patterns emerge:

1. Version counters — Track a generation ID on session history. When compaction starts, record the current generation. When results arrive, check if the generation has advanced. If it has, either reject the compaction or merge it with the new messages. Proposed for LangGraph but not yet implemented.

2. Overlapping windows — Never compact the last N messages. Google ADK uses this with its sliding window. Anthropic recommends raw context over compaction over summarization — keep as much original context as possible, especially recent messages.

3. Optimistic apply with validation — Apply the async compaction result, then run a quick validation: are key facts preserved? Does the summary mention the current task? If validation fails, roll back to pre-compaction history. This adds one more LLM call but catches the worst failures.

4. Throttled compaction — At most one compaction in flight per session. New threshold crossings queue, don't spawn. This prevents double compaction entirely and simplifies the state machine. Walrus's task registry already implements similar concurrency control with its queue-and-promote pattern.

Open questions

Is the latency savings worth the complexity? Sync compaction blocks for 2–10 seconds. For interactive agents, that's annoying. For background automation, it's irrelevant. How often does compaction actually happen in practice — once per session? Once per hundred turns? If it's rare, the engineering cost of async may not pay off.
Should results be applied immediately or at a natural break? Injecting compaction results mid-turn could confuse the agent. Waiting for a natural break (tool response, user message) is safer but adds latency. Where's the right insertion point?
Can you validate a compaction summary without another LLM call? Embedding similarity between pre- and post-compaction context could catch gross information loss. String matching for key entities could catch fact drops. Neither is as reliable as LLM-based validation, but both are cheaper.
How should async compaction appear in the task registry? Walrus's task registry tracks agent tasks as a live tree visible via walrus ps. Should background compaction appear as a task? A session annotation? Invisible infrastructure? Observability matters for debugging.
Does MemGPT's approach eliminate the need for async compaction entirely? If the agent controls its own memory paging, there's nothing to run in the background. The trade is cognitive overhead — but with capable models, that overhead shrinks. Is agent-controlled paging the endgame, making async compaction a transitional pattern?

DEV Community