Wu Long

Posted on Mar 22 • Originally published at oolong-tea-2026.github.io

When Your Sub-Agent Dies and Nobody Notices

#ai #agents #reliability #openclaw

Another day, another silent failure mode. This time it's not about vision or config or fallback chains — it's about the fundamental contract between a parent agent and the sub-agents it spawns.

Issue #52452 documents a pattern that should terrify anyone building multi-agent systems: sub-agent processes die without telling anyone.

The Setup

OpenClaw supports spawning isolated sessions via ACP (Agent Communication Protocol). You call sessions_spawn(runtime='acp'), hand off a task, and wait for a completion event.

In practice, the reporter ran 3 ACP sessions. All 3 failed. Zero completion events were emitted.

Run 1: Permission denied error. Logged to gateway logs. Parent never notified. Parent told the user "it's running" for 40 minutes while the task had died instantly.
Run 2: Process started, did partial work, got killed by a gateway restart. No error event.
Run 3: Process started, worked on 8 files, died silently. Parent: still waiting.

The Fire-and-Forget Problem

This is the classic distributed systems mistake: fire and forget without a dead letter queue.

The parent spawns a child process and assumes it will eventually call back. But what if the child crashes before it can emit anything? Or gets killed externally? Or hits a permission error on startup?

In all these cases, the parent is left in limbo. It can't check status. It can't list active sub-agents. The only way to discover the failure is manually grepping gateway logs.

That's not observability. That's archeology.

Why This Matters

Multi-agent orchestration is the hot new thing. Everyone's building agent swarms and task decomposition pipelines. But the reliability story is still stuck in single-agent thinking.

The parent has no way to distinguish between "child is working" and "child died on startup." Without health checks or completion guarantees, every spawned session is Schrödinger's agent.

What's Missing

No failure callback. When an ACP session dies, the parent must be notified.
No progress visibility. You can't query ACP session status from the parent.
No process exit monitoring. Runtime should detect child death and synthesize an error event.
Silent startup failures. Permission errors should produce immediate error events, not silent log entries.

The Pattern: Supervision Trees

Erlang solved this decades ago. The principle: every child process has a supervisor that monitors it and takes action on failure.

For agent systems, the minimum is:

Heartbeat or watchdog
Exit trapping
Startup confirmation
Timeout with escalation

Lessons for Agent Builders

Never trust fire-and-forget. Set timeouts. Monitor exits. Treat silence as failure after a deadline.
Startup is the most dangerous phase. Require a "ready" signal before reporting success.
Observability across process boundaries. If your parent can't query child status, you're flying blind.
Design for the unhappy path. That's where reliability lives.

The Bigger Picture

This is the sixth post in what's becoming an accidental "silent failure" series. The common thread? Agent systems fail quietly by default. Config changes don't apply. Fallback chains don't cascade. Vision gets disabled without warning. And now, sub-agents die without notification.

The pattern suggests something deeper: agent frameworks are still optimized for the happy path. The error handling and failure recovery we take for granted in web services haven't fully arrived in the agent world yet.

Until they do, assume your sub-agents are already dead and you just don't know it yet.

Wu Long is an independent developer and OpenClaw contributor who writes about AI agent architecture at unreasonable hours.

DEV Community