As an SDET testing AI applications, I recently encountered a bizarre issue in the OpenClaw Gateway UI. Instead of normal conversational text, the AI assistant started spitting out raw internal directive tags like [[reply_to:< and [[reply directly into the chat interface.
These tags are designed for internal message routing and should be silently stripped by the system before reaching the user. At first glance, it looked like a simple "dumb LLM" problem. But diving deeper, I uncovered a fascinating architectural trap: a perfect storm of three distinct bugs across the backend stream, the UI state logic, and the defense-in-depth strategy.
Here is how I debugged and fixed this cascading failure.
The Investigation: Hunting the Leak
Round 1: The Code Check
My first instinct was to check the stripping logic. The backend used a standard Regex REPLY_TAG_RE to find and remove fully closed tags like [[reply_to:123]]. I wrote a quick test script, and the Regex worked perfectly on complete tags. So why were they leaking?
Round 2: The Smoking Gun in the Session Logs
I bypassed the UI and checked the raw .jsonl session logs containing the LLM's raw output. I found the exact payload for the failing prompt:
{
"role": "assistant",
"content": [{"type": "text", "text": "[[reply_to_current]]\n\n[[reply_to:<id>]]\n\n...(repeats 100 times)...\n\n[[reply_to:<"}],
"stopReason": "length"
}
Two massive clues here:
- The
stopReasonwas"length"(truncated by maxTokens), not normal completion. - The model had hallucinated and repeated the system prompt instructions until it ran out of tokens, leaving the final tag incomplete (
[[reply_to:<).
Round 3: The Streaming Epiphany
Then it hit me. LLMs don't output text all at once; they stream token by token.
When the model generates [[reply_to:123]], the data flow looks like this:
-
Hello(No tag -> Safe) -
Hello [[re(Regex fails to match -> LEAKED to UI) -
Hello [[reply_to:(Regex fails to match -> LEAKED to UI) -
Hello [[reply_to:123]](Regex matches -> Stripped -> Safe)
The backend was broadcasting the "growing," incomplete tags to the frontend because the Regex only looked for fully closed brackets.
The Root Cause: A 3-Layered Trap
This wasn't just a regex failure. It was a combination of three isolated flaws that created an unrecoverable state:
Layer 1: The Streaming Leak (Backend) The
stripInlineDirectiveTagsForDisplay()function only removed closed tags. Intermediate fragments during the Server-Sent Events (SSE) stream slipped right through.Layer 2: The UI State Logic Trap (Frontend)
Inside the frontend controller, the chat stream state update had a fatal assumption:
if (!current || next.length >= current.length) {
state.chatStream = next;
}
The UI assumed text only gets longer. So, when the backend leaked Hello [[reply (15 chars), the UI saved it. But when the backend finally received the full tag, stripped it, and sent the clean Hello (6 chars), the UI rejected the update because 6 < 15! The dirty tag became permanently stuck in the UI.
- Layer 3: No Defense in Depth (Frontend) The frontend completely trusted the backend to strip the tags and had no localized sanitization function for assistant messages.
The Fix
To solve this, I implemented fixes across the stack:
1. Catching Partial Streams (Backend)
I added a PARTIAL_REPLY_TAG_RE regex specifically to target unclosed tags at the very end of the string stream:
const PARTIAL_REPLY_TAG_RE = /\[\[\s*reply(?:_to(?:_current|:\s*[^\]\n]*)?)?\s*$/i;
Now, strings like [[reply are stripped in real-time before broadcasting.
2. Implementing Defense in Depth (Frontend)
I updated the frontend processMessageText() function to independently run the stripping utility, ensuring that even if a dirty payload somehow bypasses the gateway, the UI sanitizes it before rendering.
The SDET Takeaway
- Streaming architectures break traditional parsing: When dealing with SSE or WebSockets in AI apps, you must account for intermediate "growing" states. A regex that works on static text will often fail on a stream.
-
Text length is a dangerous state metric: Never assume an LLM's output string will monotonically increase in length. Formatting, redaction, or tag stripping will shrink the string, breaking
next.length >= current.lengthupdate logic. -
Session Logs are your best friend: When the UI misbehaves, don't guess. Go straight to the raw JSON logs. A
stopReason: "length"is a massive red flag.
By applying defense-in-depth, we ensured that this UI bug is gone for good.
Top comments (0)