Seeing what your agents are doing: the task registry problem

#ai #architecture #opensource #openwalrus

The previous post settled the planning question:
plan mode is a prompt, not a runtime primitive. Plans belong in skills.

That left one problem open:

"When an agent dispatches a subagent, neither the parent agent nor the user
has visibility into what the subagent is doing."

This post covers that problem and walrus's solution.

The problem

Modern coding agents regularly spawn multiple subagents — a research agent,
a planning agent, several parallel implementation agents. Each runs its own
context window, makes its own tool calls, and maintains its own state.

From the outside, they're black boxes.

Claude Code emits two hook events: SubagentStart and SubagentStop.
That's lifecycle signaling — you know a subagent started and stopped, but
not what it's working on, whether it's stuck, or how much context it's consumed.

GitHub issue #24537
— a request for an agent hierarchy dashboard — has been open and unanswered.
The issue captures the real scenario precisely:

"The Claude Code conversation view was designed for a human talking to one
agent. It was never meant to be a control plane for 7 concurrent subagents
across multiple sessions."

A concrete example: you request a feature implementation across five files.
Claude spawns three parallel subagents — Explore, Plan, Bash — each making
30+ tool calls over ten minutes. In the current interface, you see
interleaved tool summaries with no way to answer: "Which agent is stuck?"
or "How much context has each consumed?"

How current systems handle it

[Interactive chart — see original post]
Claude Code gets partial credit for live task tracking — the TodoWrite
convention lets an agent report what it's doing, but only within a single
agent context. A parent agent can't read a subagent's todo list. The hook
events (SubagentStart/SubagentStop) are lifecycle signals, not task
state. Community workarounds exist (one project pipes hooks to an HTTP
server, stores in SQLite, streams via WebSocket to a browser dashboard) but
nothing is built in.

Devin has good task visibility — a live progress list with the current
step, the ability to redirect mid-task, and an approval gate before
execution starts. But Devin is architecturally single-agent: the planning
and implementation happen in the same agent context. There's no subagent
hierarchy to expose.

Cursor runs background agents in Git worktrees with notifications on
completion. Worktree isolation is clean. What's missing: a unified view of
what each running agent is working on. You know agents are running; you
don't know what they're doing.

OpenTelemetry's GenAI SIG is standardizing spans and traces for AI
agent frameworks — useful for infrastructure teams, not helpful for a user
asking "which subagent is stuck on this auth change."

Why "just use a prompt" doesn't work

Plan mode works as a prompt because it's intra-agent behavioral guidance:
Claude is told "don't execute yet" and follows the instruction. The
instruction and the agent it affects are the same entity.

Cross-agent visibility is different. A subagent can be instructed to call
update_task("researching auth flow"). But who reads that call? The parent
agent isn't watching the subagent's context window. There's no shared
channel unless the runtime provides one.

This is the core asymmetry:

Prompts are intra-agent — they shape the behavior of the agent receiving them
Registries are inter-agent — they create a shared channel that parent agents and users can observe

You can't prompt your way to cross-agent observability. The runtime has to
provide the channel.

The walrus task registry

Walrus maintains an in-memory task registry as a first-class runtime
primitive — a concurrent hash map that lives in the walrus process.
Each entry records the agent's id, its parent, current status, a
plain-English summary, and timestamps.

The agent API is a single call: update_task(id, status, summary). An
agent calls this when it starts work, when it completes a step, and when
it hands off to a subagent. The registry records the parent-child
relationship automatically from the call context.

Diagram — see original post

walrus ps reads the registry and renders a live tree — agent id,
current summary, elapsed time, status — without any database round-trip.

The registry is session-scoped and ephemeral. It lives in memory for the
duration of the session, which means reads are microsecond-fast — no
database round-trip for a live view. When the session ends, completed tasks
flush to LanceDB as Episode nodes: durable, queryable, part of the
agent's history.

This gives you three time horizons:

Now — walrus ps (live, in-memory, instant)
This session — full task tree available while the session runs
History — walrus memory show --episodes (graph, queryable across sessions)

What this enables

Live intervention. When a subagent is running in a loop or working on
the wrong thing, the user can identify it from walrus ps and cancel or
redirect. Without the registry, the only option is to kill the whole session.

Session replay. After a complex multi-agent run, the task tree (flushed
to the graph) is a structured record of what happened: which agent did what,
in what order, and what the outcome was. This is more useful than scrolling
through a conversation log.

Debugging stuck agents. Staleness is detectable from the last-updated
timestamp on each entry. walrus ps can surface warnings when an agent
hasn't reported progress in an unusual amount of time.

Cost attribution. Per-agent context usage (a future registry field)
makes it possible to see which subagent consumed most of the session budget.

Open questions

Three design questions aren't settled yet and will be covered in follow-up posts:

Distributed sessions. If walrus runs multiple processes (client +
server, or multiple workers), the in-memory registry doesn't span processes.
This is a future problem — local single-process sessions are the target
for now — but the design should not make it hard to add a shared registry
later.

Approval gates. Should the task registry be the enforcement point for
user approval before destructive actions? When an agent reports it's about
to delete or modify something irreversible, should the runtime pause and
prompt? This connects to the permissions and sandboxing design, which is a
separate post.

Retention. How long to keep completed episode nodes in the graph?
Forever is the obvious answer until you have a lot of sessions. Retention
policy, pruning, and export are part of the memory management design.