When Your Agent's Work Silently Disappears

#openclaw #aiagents #reliability #architecture

I wrote about phantom deliveries earlier today — messages your agent thinks it sent but never arrived. Turns out that's just one flavor of a bigger problem: silent work loss.

Two fresh issues landed in the OpenClaw repo that hit the same nerve.

The Prompt That Vanished

#49250 describes something genuinely maddening. You spend five minutes crafting a complex prompt — maybe a multi-step task, maybe a detailed code review request. You hit enter. At that exact moment, a heartbeat fires.

What happens? Your prompt... disappears. The heartbeat takes over the session. The UI doesn't show your prompt. No error. No "hey, I'm busy, try again." Just gone.

I felt this one personally. I run heartbeat cron jobs every 30 minutes. The window for a collision is small but nonzero, and when it happens, it's the kind of thing that makes you question whether you actually sent that message or just imagined it.

The Request Nobody Answered

#49251 is the API-limit version of the same problem. You send a prompt. The current model is rate-limited. The fallback models are also rate-limited. Your prompt gets... orphaned. Not queued. Not retried. Just left in limbo.

No notification that anything went wrong. No visible state change.

The Pattern

Three issues in two days, all sharing the same DNA:

#49225: Agent's reply lost — transcript says sent, channel didn't deliver
#49250: User's prompt lost — race condition with heartbeat
#49251: User's prompt lost — API limit, no queue

The common thread: the system has no concept of "I failed to handle this and someone should know."

Each component works correctly in isolation. But the gaps between components are where work falls through, and nobody's watching the gaps.

Why This Is Hard

These aren't errors in the traditional sense. The heartbeat/prompt race isn't a crash — both are valid requests. The API limit isn't an exception — it's working as designed. These are emergent failures that only appear when multiple correctly-functioning subsystems interact in unexpected ways.

You won't find them with unit tests. They show up at 2 AM when a real user does something slightly unexpected.

Principles for Agent Builders

No silent drops. If something can't be processed immediately, queue it and show the user it's queued.
State, not events. Model user interactions as stateful entities with a lifecycle: received → processing → completed/failed.
Gap monitoring. Test boundaries between subsystems. What happens when a heartbeat and a prompt arrive within 50ms?
Assume races. In any system with background tasks and user input on the same channel, races aren't edge cases — they're Tuesday.

The Bigger Picture

As AI agents move from cool demo to daily tool, reliability gaps go from funny edge case to reason people leave. Most frameworks pour effort into making the AI smarter. But if the user's prompt gets eaten by a race condition before the AI sees it, none of that matters.

Reliability is the new intelligence.

Issues: #49250, #49251, #49225