DEV Community: David Yan

When AI Leaks Internal Tags: Debugging a 3-Layer Streaming Architecture Bug

David Yan — Fri, 03 Apr 2026 09:17:32 +0000

As an SDET testing AI applications, I recently encountered a bizarre issue in the OpenClaw Gateway UI. Instead of normal conversational text, the AI assistant started spitting out raw internal directive tags like [[reply_to:< and [[reply directly into the chat interface.

These tags are designed for internal message routing and should be silently stripped by the system before reaching the user. At first glance, it looked like a simple "dumb LLM" problem. But diving deeper, I uncovered a fascinating architectural trap: a perfect storm of three distinct bugs across the backend stream, the UI state logic, and the defense-in-depth strategy.

Here is how I debugged and fixed this cascading failure.

The Investigation: Hunting the Leak

Round 1: The Code Check

My first instinct was to check the stripping logic. The backend used a standard Regex REPLY_TAG_RE to find and remove fully closed tags like [[reply_to:123]]. I wrote a quick test script, and the Regex worked perfectly on complete tags. So why were they leaking?

Round 2: The Smoking Gun in the Session Logs

I bypassed the UI and checked the raw .jsonl session logs containing the LLM's raw output. I found the exact payload for the failing prompt:

{
  "role": "assistant",
  "content": [{"type": "text", "text": "[[reply_to_current]]\n\n[[reply_to:<id>]]\n\n...(repeats 100 times)...\n\n[[reply_to:<"}],
  "stopReason": "length"
}

Two massive clues here:

The stopReason was "length" (truncated by maxTokens), not normal completion.
The model had hallucinated and repeated the system prompt instructions until it ran out of tokens, leaving the final tag incomplete ([[reply_to:<).

Round 3: The Streaming Epiphany

Then it hit me. LLMs don't output text all at once; they stream token by token.
When the model generates [[reply_to:123]], the data flow looks like this:

Hello (No tag -> Safe)
Hello [[re (Regex fails to match -> LEAKED to UI)
Hello [[reply_to: (Regex fails to match -> LEAKED to UI)
Hello [[reply_to:123]] (Regex matches -> Stripped -> Safe)

The backend was broadcasting the "growing," incomplete tags to the frontend because the Regex only looked for fully closed brackets.

The Root Cause: A 3-Layered Trap

This wasn't just a regex failure. It was a combination of three isolated flaws that created an unrecoverable state:

Layer 1: The Streaming Leak (Backend) The stripInlineDirectiveTagsForDisplay() function only removed closed tags. Intermediate fragments during the Server-Sent Events (SSE) stream slipped right through.
Layer 2: The UI State Logic Trap (Frontend)
Inside the frontend controller, the chat stream state update had a fatal assumption:

   if (!current || next.length >= current.length) {
     state.chatStream = next;
   }

The UI assumed text only gets longer. So, when the backend leaked Hello [[reply (15 chars), the UI saved it. But when the backend finally received the full tag, stripped it, and sent the clean Hello (6 chars), the UI rejected the update because 6 < 15! The dirty tag became permanently stuck in the UI.

Layer 3: No Defense in Depth (Frontend) The frontend completely trusted the backend to strip the tags and had no localized sanitization function for assistant messages.

The Fix

To solve this, I implemented fixes across the stack:

1. Catching Partial Streams (Backend)
I added a PARTIAL_REPLY_TAG_RE regex specifically to target unclosed tags at the very end of the string stream:

const PARTIAL_REPLY_TAG_RE = /\[\[\s*reply(?:_to(?:_current|:\s*[^\]\n]*)?)?\s*$/i;

Now, strings like [[reply are stripped in real-time before broadcasting.

2. Implementing Defense in Depth (Frontend)
I updated the frontend processMessageText() function to independently run the stripping utility, ensuring that even if a dirty payload somehow bypasses the gateway, the UI sanitizes it before rendering.

The SDET Takeaway

Streaming architectures break traditional parsing: When dealing with SSE or WebSockets in AI apps, you must account for intermediate "growing" states. A regex that works on static text will often fail on a stream.
Text length is a dangerous state metric: Never assume an LLM's output string will monotonically increase in length. Formatting, redaction, or tag stripping will shrink the string, breaking next.length >= current.length update logic.
Session Logs are your best friend: When the UI misbehaves, don't guess. Go straight to the raw JSON logs. A stopReason: "length" is a massive red flag.

By applying defense-in-depth, we ensured that this UI bug is gone for good.

Debugging a CLI Tool Blindspot: Why "Reload" Commands Don't Always Reload Everything

David Yan — Thu, 02 Apr 2026 09:36:05 +0000

As developers, we love automation tools and CLI wrappers. They save us time, abstract away complex configurations, and make our workflows smoother. But what happens when the tool designed to manage your configuration lies to you?

Recently, I encountered a fascinating edge case while managing API keys for my Claude Code environment using a third-party CLI helper (@z_ai/coding-helper).

This is a story about "split-brain" configurations, digging into node_modules to find the truth, and why you should never blindly trust a CLI's success message.

The Ghostly "429 Plan Expired" Error

My setup involved using Claude Code powered by a specific API plan (GLM Coding Plan). After my initial API key expired, I renewed my subscription, got a new key, and ran the standard command provided by the CLI helper to reload my credentials:

chelper auth reload claude

I even ran the built-in health check (chelper doctor), and everything returned a perfect string of green checkmarks. Path? Valid. API Key & Network? Valid. Plan? Active.

However, the moment I tried to invoke a Model Context Protocol (MCP) tool—specifically, the web-reader tool—the agent instantly crashed with this error:

Error: MCP error -429: {"error":{"code":1309,"message":"Your GLM Coding Plan has expired, please renew to restore access."}}

Wait a minute. The health check passed, standard chats worked perfectly, but MCP tools were screaming that my account was expired.

The Investigation: Spot the Difference

Since the UI and the CLI tool were giving me conflicting information, I bypassed them and went straight to the underlying configuration files.

I checked the two places where credentials could theoretically be stored:

The CLI helper's main config (~/.chelper/config.yaml): This contained my brand new, valid API key.
Claude Code's MCP config (~/.claude.json): Looking at the mcpServers object, I checked the HTTP headers.

Boom. The Authorization: Bearer header for the MCP tools was still using the old, expired API key.

We had a "split-brain" scenario. The MCP tools were reading from a stale configuration file.

Root Cause Analysis: Diving into `node_modules`

Why didn't chelper auth reload update the MCP config? To find out, I opened up the CLI's source code located in node_modules/@z_ai/coding-helper/dist/lib/claude-code-manager.js.

I tracked down the exact function responsible for reloading the configuration:

loadGLMConfig(plan, apiKey) {
    // 1. Ensure onboarding is completed
    this.ensureOnboardingCompleted();

    // 2. Clean up shell environment variables
    this.cleanupShellEnvVars();

    // 3. Update ANTHROPIC_AUTH_TOKEN in settings.json
    this.saveSettings(glmConfig);
}

The flaw in the architecture was immediately obvious.

Claude Code actually splits its configuration into two distinct files:

~/.claude/settings.json (Used for standard Claude API calls)
~/.claude.json (Used to register MCP servers and their auth headers)

The developer of the CLI tool only programmed this.saveSettings() to update the standard API token. The tool completely ignored the MCP server configurations. As a result, any HTTP-based MCP tools registered previously were left stranded with expired credentials.

The Fix

If you are dealing with a similar misbehaving CLI wrapper, you can't rely on its reload command. You have a few options:

Option 1: The Nuclear Option (Recommended)
Run the initialization command again from scratch. This usually forces the tool to overwrite all files completely.

npx @z_ai/coding-helper init

Option 2: The Python Automation Fix
If you have heavily customized your mcpServers and don't want to reset everything, I wrote a quick Python script to automatically hunt down HTTP-based MCP tools in your config and update their headers securely:

import json
import os

config_path = os.path.expanduser('~/.claude.json')
new_key = "YOUR_NEW_VALID_API_KEY"

try:
    with open(config_path, 'r') as f:
        config = json.load(f)

    # Iterate through all configured MCP servers
    for mcp_name, mcp_config in config.get('mcpServers', {}).items():
        # Update only those that use HTTP headers for authentication
        if 'headers' in config.get('mcpServers', {}).get(mcp_name, {}):
            config['mcpServers'][mcp_name]['headers']['Authorization'] = f'Bearer {new_key}'
            print(f"Updated auth token for MCP tool: {mcp_name}")

    with open(config_path, 'w') as f:
        json.dump(config, f, indent=2)
    print("All MCP keys updated successfully!")

except FileNotFoundError:
    print("Error: ~/.claude.json configuration file not found.")

The SDET Takeaway

This is why Software Development Engineers in Test (SDETs) don't just rely on UI green lights or standard CLI outputs.

When an automation tool tells you "Success!" but the system behaves as if it failed, there is almost always a state synchronization issue underneath. The abstraction layers designed to help us can quickly become blindspots.

The Golden Rule of Debugging Toolchains: When system behavior contradicts configuration state, stop trusting the management tools and start reading the raw .json and .yaml files.

Debugging a 400 Error: How a Silent API Gateway Update Broke My LLM Agent

David Yan — Wed, 01 Apr 2026 02:42:09 +0000

As an SDET (Software Development Engineer in Test), I spend a lot of time breaking things. But there is a special kind of frustration when an environment that worked perfectly yesterday suddenly throws a fatal error today, despite zero changes to your local code or configuration.

This is the story of how a silent, server-side API validation update completely broke my local AI agent workflow, and how I debugged it by diving into the raw JSON payloads.

If you are building AI agents using third-party LLM gateways or Model Context Protocol (MCP) tools, this debugging journey might save you hours of pulling your hair out.

The Setup & The Incident

I was using Claude Code CLI (v2.1.69), but instead of routing it to Anthropic’s official API, I pointed the base URL to Zhipu AI's Anthropic-compatible endpoint (https://open.bigmodel.cn/api/anthropic), powered by their glm-5.1 model. This is a common, cost-effective architecture for developers testing local AI agents.

Everything was running smoothly. I was using MCP tools for image analysis and web searching. Then, suddenly, the agent flatlined.

When I prompted the agent to do a simple task that required tools (e.g., "Check what files are in the current directory"), it instantly crashed with a 400 Bad Request:

API Error: 400
{"error":{"code":"1214","message":"messages[4].content[0].type类型错误"},"request_id":"20260330140839a4b83527b21b4c46"}

(Translation of the error: "Type error in messages[4].content[0].type")

The most bizarre part? Standard text chats worked perfectly. But the moment the agent tried to touch any MCP tool, it triggered the 1214 loop of death.

The Investigation: Hunting the Ghost in the Machine

My first instinct was to blame context limit limits or a corrupted local cache. I cleared the .claude hidden directories and reset the session. The error persisted.

It was time to stop guessing and look at the raw network traffic.

By inspecting the exact payload Claude Code was sending to the API right before the crash, I found the culprit hidden deep within the Anthropic protocol's message array:

{
  "role": "user",
  "content": [
    {
      "type": "tool_result",
      "tool_use_id": "call_xxx",
      "content": [
        {
          "type": "tool_reference",
          "tool_name": "Bash"
        }
      ]
    }
  ]
}

The "Aha!" Moment

Claude Code v2.x relies heavily on a feature called ToolSearch. Because it supports dynamic MCP tools, it doesn't load everything into the context window at once (they are marked as deferred). Instead, it searches for the right tool and then alerts the LLM using a highly specific, internal Anthropic content block: "type": "tool_reference".

The Zhipu API gateway, built to mimic the Anthropic protocol, simply did not know what tool_reference was.

The Root Cause Analysis (RCA)

But wait—if the gateway didn't support it, why did it work perfectly just hours ago?

I checked my terminal logs and constructed a timeline:

March 30, 04:06 AM: Last successful ToolSearch invocation.
March 30, 06:05 AM: First appearance of the 1214 error.

The Conclusion: The API provider deployed a silent backend update during that 2-hour window.

Previously, their gateway likely had a lenient validation policy: If you see a JSON field or type you don't recognize, just ignore it. However, the new update introduced strict Schema Validation. The moment the gateway encountered the unsupported tool_reference type, it rejected the entire payload with a 400 error.

This single, undocumented API regression caused a catastrophic failure downstream:

ToolSearch failed completely.
All deferred MCP tools became undiscoverable.
Image analysis MCPs (mcp__zai-mcp-server__analyze_image) were rendered completely useless.

The Workaround

Until the API provider updates their compatibility layer to support the tool_reference block, the only way to unblock the workflow is to surgically disable the feature generating the payload.

If you are facing this exact issue, you can temporarily bypass it by modifying your ~/.claude/settings.json to disable tool searching:

{
  "env": {
    "ANTHROPIC_BASE_URL": "https://open.bigmodel.cn/api/anthropic",
    "ANTHROPIC_AUTH_TOKEN": "your_api_key",
    "ENABLE_TOOL_SEARCH": "0"
  }
}

Trade-off: This stops the 400 errors, but you will lose access to dynamic MCP tool discovery and image analysis capabilities.

The SDET Takeaway

This incident is a textbook example of why building applications on top of "API masquerading/compatibility layers" is inherently fragile.

When you sit between a rapidly iterating client (Claude Code) and a third-party gateway trying to reverse-engineer its protocol, you are at the mercy of both sides. A minor schema update on the client, or a strict validation patch on the server, will snap the integration in half.

The ultimate lesson? When the UI says "Unknown Error," don't just reboot. Grab your network inspector, dive into the raw JSON payloads, and follow the data. The truth is always in the Schema.

Have you encountered similar protocol mismatches while building LLM agents? Let me know in the comments!

DEV Community: David Yan

When AI Leaks Internal Tags: Debugging a 3-Layer Streaming Architecture Bug

The Investigation: Hunting the Leak

Round 1: The Code Check

Round 2: The Smoking Gun in the Session Logs

Round 3: The Streaming Epiphany

The Root Cause: A 3-Layered Trap

The Fix

The SDET Takeaway

Debugging a CLI Tool Blindspot: Why "Reload" Commands Don't Always Reload Everything

The Ghostly "429 Plan Expired" Error

The Investigation: Spot the Difference

Root Cause Analysis: Diving into node_modules

The Fix

The SDET Takeaway

Debugging a 400 Error: How a Silent API Gateway Update Broke My LLM Agent

The Setup & The Incident

The Investigation: Hunting the Ghost in the Machine

The "Aha!" Moment

The Root Cause Analysis (RCA)

The Workaround

The SDET Takeaway

Root Cause Analysis: Diving into `node_modules`