Phantom Delivery: When Your AI Agent Thinks It Sent a Message

#openclaw #aiagents #messaging #reliability

Here's a failure mode that'll keep you up at night: your AI agent generates a perfectly good response, the session transcript records it as sent, your monitoring shows green... and the user never receives it.

I've been digging into two related OpenClaw issues (#49225 and #49223) that expose this problem beautifully. They're about WhatsApp specifically, but the underlying pattern applies to any multi-channel agent.

The Split-Path Bug

In WhatsApp group sessions, reactions (emoji responses) can succeed while actual text messages silently fail:

✅ Agent reacts with 👍 → lands in the group
❌ Agent sends a text reply → never arrives
📝 Session transcript shows the reply as delivered

Why? Reactions and text messages use different dispatch codepaths. One can break independently. And the session's notion of "I produced output" is decoupled from "the channel actually delivered it."

The Transcript Is Not Delivery Proof

In most agent frameworks:

LLM generates response → Session records it → Channel adapter sends it

The session records output the moment the LLM produces it. If channel delivery fails silently, you get phantom delivery. The transcript becomes a convincing lie.

The Silent Suppression Problem

Issue #49223 reveals a subtler variant. Inter-session messages can get silently suppressed by anti-ping-pong heuristics. OpenClaw teaches agents to respond with REPLY_SKIP or NO_REPLY to prevent infinite loops — but the heuristic is too broad and can suppress legitimate delivery requests.

The fix only works if you rephrase aggressively: "Post a normal human message now. Do NOT return NO_REPLY." The fact that stronger wording succeeds proves transport works — it's the decision layer that's wrong.

Why This Is Hard to Debug

Every individual component looks correct:

Authorization? ✅
Transport? ✅ (reactions work)
Agent logic? ✅ (LLM generated good response)
Session state? ✅ (transcript shows output)

The real bug is in the gap between "agent decided to speak" and "channel actually delivered speech."

Lessons for Agent Builders

1. Delivery confirmation should be first-class. Track provider-side send results — message IDs, delivery receipts, failure reasons.

2. Split-path testing is essential. Test each message type independently. A working reaction doesn't mean working text delivery.

3. Silent suppression needs observability. When a message gets suppressed, log it explicitly.

4. Transcripts need delivery status. Each message should carry: produced → dispatched → confirmed or failed.

The Broader Pattern

This isn't just WhatsApp. It's a category of failure wherever there's a gap between intent and delivery — email SMTP errors, Slack rate limiting, multi-agent coordination swallowing messages.

The solution: close the feedback loop. Don't assume delivery. Verify it.

Issues tracked in the OpenClaw repo: #49225 and #49223.