Johnny

Posted on Dec 12

Why Your API's Error Messages Fail When Called by an LLM (And How to Fix Them)

#systemdesign #agents #llm #api

TL;DR: When you build tools that LLMs call autonomously—OpenAI functions, Claude tools, MCP servers, custom APIs—traditional error messages break the agent workflow. A human can ask "what's a valid reference?" An LLM can't. I rebuilt Verdex's error handling to give LLMs structured recovery plans instead of descriptions. The result: errors that were conversation-killers became recoverable, autonomous workflows stopped failing silently, and the LLM could fix issues without human intervention.

The Problem: When Your User Can't Ask Questions

Here's what happens when a human gets an error:

Error: Unknown element reference: e999

The human thinks:

"What's an element reference?"
"Where do I get valid ones?"
"Why did mine become invalid?"

Then they ask you, and you explain.

Here's what happens when an LLM gets that same error:

Error: Unknown element reference: e999

The LLM:

Tries the same thing again (with the same invalid ref)
Gets the same error
Gives up or hallucinates a solution
The entire agentic workflow fails

The LLM can't ask clarifying questions. It needs to autonomously recover or the task is dead.

What I Was Building

I'm working on Verdex, an MCP server for browser automation. It exposes tools like browser_click(ref), browser_type(ref, text), and browser_snapshot(). The LLM navigates pages, fills forms, and extracts data—all autonomously.

When I first deployed it, I saw this pattern constantly:

LLM: browser_click("e5")
Error: Unknown element reference: e5

LLM: browser_click("e5")  // Tries exact same thing
Error: Unknown element reference: e5

LLM: "I encountered an error. The element reference appears to be invalid."
[workflow ends]

The LLM had no idea that:

Element refs come from browser_snapshot()
Navigation invalidates old refs
It needed to call browser_snapshot() first to get fresh refs

It couldn't ask me. I wasn't there. The error message didn't tell it how to recover.

The Pattern: Structured Recovery Plans

I rebuilt every error message to follow this structure:

❌ [Error Type]

[What failed - with specific details]

[Why it failed - likely causes]

🔧 Action Required:
[Numbered steps to recover]

Here's the before and after:

Before (Developer-Focused)

Error: Unknown element reference: e999

After (LLM-Focused)

❌ Unknown Element Reference

Reference: e999

This reference doesn't exist in the current snapshot.

Possible causes:
• Using a ref from an old snapshot (stale after navigation)
• Typo in the ref name
• Element not yet loaded or not interactive

🔧 Action Required:
1. Call browser_snapshot() to see currently available elements
2. Find the correct element reference in the new snapshot
3. Use the correct ref from the latest snapshot

The difference:

Context: What a "reference" is and where they come from
Diagnosis: Why this specific one is invalid
Recovery: Exact API calls needed to fix it (with names!)
Ordering: Numbered steps the LLM can follow sequentially

Implementation: Rich Error Classes

Every error type gets its own class with structured properties:

export class UnknownRefError extends Error {
  constructor(public ref: string) {
    super(
      `Unknown element reference: ${ref}. ` +
      `Ref may be stale after navigation. Take a new snapshot to get fresh refs.`
    );
    this.name = "UnknownRefError";
  }
}

export class StaleRefError extends Error {
  constructor(
    public ref: string,
    public elementInfo: { role: string; name: string; tagName: string }
  ) {
    super(
      `Element ${ref} (${elementInfo.role} "${elementInfo.name}") was removed from DOM. ` +
      `Take a new snapshot() to refresh refs.`
    );
    this.name = "StaleRefError";
  }
}

export class FrameDetachedError extends Error {
  constructor(public frameId: string, details?: string) {
    super(`Frame ${frameId} was detached${details ? `: ${details}` : ""}`);
    this.name = "FrameDetachedError";
  }
}

export class NavigationError extends Error {
  constructor(public url: string, public role: string, details: string) {
    super(`Navigation failed for role '${role}' to '${url}': ${details}`);
    this.name = "NavigationError";
  }
}

export class AuthenticationError extends Error {
  constructor(
    public role: string,
    public authPath: string,
    public reason: string
  ) {
    super(
      `Authentication required for role '${role}' but failed to load from ${authPath}: ${reason}`
    );
    this.name = "AuthenticationError";
  }
}

The properties (ref, elementInfo, frameId, etc.) enable the formatter to provide specific, actionable guidance.

The Formatter: Error Messages as Recovery Scripts

At the MCP server layer, I intercept all errors and format them for LLM consumption:

async callTool(name: string, args: any) {
  try {
    // Route to appropriate handler...
    return await handler(args);
  } catch (error) {
    return {
      content: [{
        type: "text",
        text: this.formatErrorForLLM(error),
      }],
    };
  }
}

private formatErrorForLLM(error: unknown): string {
  // Unknown reference - ref doesn't exist in snapshot
  if (error instanceof UnknownRefError) {
    return `❌ Unknown Element Reference

Reference: ${error.ref}

This reference doesn't exist in the current snapshot.

Possible causes:
• Using a ref from an old snapshot (stale after navigation)
• Typo in the ref name
• Element not yet loaded or not interactive

🔧 Action Required:
1. Call browser_snapshot() to see currently available elements
2. Find the correct element reference in the new snapshot
3. Use the correct ref from the latest snapshot`;
  }

  // Stale reference - element was removed from DOM
  if (error instanceof StaleRefError) {
    return `❌ Stale Element Reference

Element: ${error.ref}
Type: ${error.elementInfo.role}
Label: "${error.elementInfo.name}"
Tag: <${error.elementInfo.tagName}>

The element was removed from the DOM, likely due to:
• Page navigation or refresh
• Dynamic content update
• JavaScript manipulation

🔧 Action Required:
Call browser_snapshot() to get fresh element references, then retry your action.`;
  }

  // Frame detached - iframe removed during operation
  if (error instanceof FrameDetachedError) {
    return `❌ Frame Detached

Frame ID: ${error.frameId}

An iframe was removed or navigated during the operation.

This is often normal during:
• Navigation between pages
• Single-page app (SPA) route changes
• Dynamic iframe removal by JavaScript

🔧 Action Required:
Call browser_snapshot() to see the current page structure and available frames.`;
  }

  // Authentication failed - required auth cannot load
  if (error instanceof AuthenticationError) {
    return `❌ Authentication Required

Role: ${error.role}
Auth File: ${error.authPath}

Failed to load authentication data: ${error.reason}

This role requires authentication but the auth file couldn't be loaded.

Possible causes:
• Auth file doesn't exist at specified path
• Auth file has invalid JSON format
• Auth file permissions prevent reading
• Path specified incorrectly in configuration

🔧 Action Required:
1. Verify auth file exists: ${error.authPath}
2. Check file permissions (must be readable)
3. Validate JSON format in auth file
4. If auth is optional, set authRequired: false in role config
5. Run auth capture process if credentials expired`;
  }

  // Navigation failed - couldn't navigate to URL
  if (error instanceof NavigationError) {
    return `❌ Navigation Failed

URL: ${error.url}
Role: ${error.role}

${error.message}

Possible causes:
• Invalid or unreachable URL
• Network connectivity issues
• Server error (404, 500, etc.)
• Authentication required (check warnings in snapshot)
• Timeout (page took too long to load)
• Main frame injection failed

🔧 Action Required:
• Verify the URL is correct and accessible
• Check network connectivity
• Call browser_snapshot() to see current page state
• Check role authentication status via getFailures()
• Try a different URL or retry after a moment`;
  }

  // Generic error fallback
  if (error instanceof Error) {
    return `❌ Error

${error.message}

If this error persists, check:
• Your input parameters
• Current page state (call browser_snapshot())
• Network connectivity
• Browser logs for additional context`;
  }

  // Unknown error type
  return `❌ Unknown Error

${String(error)}

This is an unexpected error type. Please report this issue with context about what operation you were attempting.`;
}

Universal Principles for LLM Error Messages

After implementing this across 8 error types, here's what works:

1. Explicit Tool/Function Names

❌ Bad: "Get a new snapshot"

✅ Good: "Call browser_snapshot()"

The LLM needs the exact function name it should call. Don't make it guess.

2. Numbered Recovery Steps

❌ Bad: "You need to refresh the page and try again"

✅ Good:

1. Call browser_navigate(url)
2. Call browser_snapshot() to get new refs
3. Find the button in the new snapshot
4. Retry browser_click() with the new ref

LLMs follow numbered lists well. They struggle with prose instructions.

3. Explain the Why

❌ Bad: "Invalid ref"

✅ Good: "This reference doesn't exist because navigation invalidates old refs"

Understanding causation helps the LLM avoid the same mistake next time.

4. Multiple Diagnosis Options

Possible causes:
• Using a ref from an old snapshot (stale after navigation)
• Typo in the ref name
• Element not yet loaded or not interactive

The LLM can pattern-match against its recent actions to figure out which cause applies.

5. Include Structured Data

Element: e5
Type: button
Label: "Submit"
Tag: <button>

Structured info helps the LLM recognize what it was trying to interact with, making it easier to find the element in a fresh snapshot.

6. Distinguish Expected vs Unexpected

Some failures are normal:

Frame detachment during navigation
Cross-origin iframe access denied
Element not found (might load later)

Mark these explicitly: "This is often normal during..." vs "This is an unexpected error."

It prevents the LLM from treating every error as fatal.

Pattern 2: Track, Then Decide

For operations with partial failures, I separate tracking from policy enforcement.

The Problem

// ❌ Don't decide criticality at failure site
try {
  await injectIntoFrame(frameId);
} catch (error) {
  if (frameId === mainFrameId) {
    throw error; // Critical!
  }
  // Otherwise ignore?
}

This couples failure handling with business logic. Every injection site needs to know what's critical.

The Solution: FailureLog

type FailureLog = {
  frameInjectionFailures: Array<{
    frameId: string;
    error: string;
    reason: "cross-origin" | "detached" | "timeout" | "unknown";
    isMainFrame: boolean; // Track criticality as metadata
    timestamp: number;
  }>;
  frameExpansionFailures: Array<{
    ref: string;
    error: string;
    detached: boolean;
    timestamp: number;
  }>;
  authLoadError?: {
    error: string;
    authPath: string;
    timestamp: number;
  };
};

Step 1: Operations track all failures

async injectFrameTreeRecursive(
  context: RoleContext,
  frameTree: any,
  isMainFrame: boolean = false
): Promise<void> {
  try {
    await context.bridgeInjector.ensureFrameState(
      context.cdpSession,
      frameTree.frame.id
    );
  } catch (error) {
    // Track in FailureLog (don't throw yet)
    const failures = this.ensureFailureLog(context);
    failures.frameInjectionFailures.push({
      frameId: frameTree.frame.id,
      error: error.message,
      reason: this.classifyFrameError(error),
      isMainFrame, // Metadata, not decision
      timestamp: Date.now(),
    });
    return;
  }
}

Step 2: Decision points check FailureLog

async navigate(url: string): Promise<Snapshot> {
  // ... navigation logic ...

  await this.discoverAndInjectFrames(context);

  // DECISION POINT: Check for critical failures
  const mainFrameFailed = context.failures?.frameInjectionFailures
    .some(f => f.isMainFrame);

  if (mainFrameFailed) {
    throw new Error('Main frame injection failed - page cannot be automated');
  }

  // Non-critical failures become warnings
  snapshot.warnings = this.buildWarningsFromFailureLog(context);
  return snapshot;
}

Step 3: Warnings expose non-critical failures

private buildWarningsFromFailureLog(context: RoleContext) {
  const failures = context.failures;
  if (!failures) return undefined;

  const warnings: any = {};

  // Check for inaccessible frames (non-main frames that failed)
  const inaccessibleFrames = failures.frameInjectionFailures
    .filter(f => !f.isMainFrame);

  if (inaccessibleFrames.length > 0) {
    warnings.inaccessibleFrames = inaccessibleFrames.length;
    warnings.details = inaccessibleFrames.map(f => 
      `Frame ${f.frameId}: ${f.reason}`
    );
  }

  // Check for auth failures
  if (failures.authLoadError) {
    warnings.authStatus = "unauthenticated";
    warnings.details = warnings.details || [];
    warnings.details.push(`Auth failed: ${failures.authLoadError.error}`);
  }

  return Object.keys(warnings).length > 0 ? warnings : undefined;
}

This pattern appears in the snapshot output:

{
  "text": "- button \"Submit\" [ref=e1]\n...",
  "elementCount": 15,
  "warnings": {
    "inaccessibleFrames": 2,
    "details": [
      "Frame abc123: cross-origin",
      "Frame def456: detached"
    ]
  }
}

Why This Works for LLMs

The LLM sees:

✅ Operation succeeded (got a snapshot)
✅ Partial failures are transparent (warnings)
✅ Clear reason for each failure
✅ Can proceed with main content

Without warnings, the LLM doesn't know if missing content is a problem or expected behavior.

What Changed

Before: LLMs retried the same failed operation repeatedly

LLM: browser_click("e5")
Error: Unknown element reference: e5

LLM: browser_click("e5")  
Error: Unknown element reference: e5

LLM: browser_click("e5")
Error: Unknown element reference: e5

[workflow fails]

After: LLMs autonomously recover

LLM: browser_click("e5")
Error: Unknown element reference: e5
[Error includes recovery steps mentioning browser_snapshot()]

LLM: browser_snapshot()
[Gets fresh refs, sees e7 is the submit button]

LLM: browser_click("e7")
Success!

Error recovery rate went from ~20% to ~95%. Most failures are now self-healing.

Conversation length decreased. Errors that required human intervention ("what ref should I use?") now resolve autonomously.

Workflow reliability improved. Multi-step tasks that would fail on the first error now complete end-to-end.

Adapting This to Your Stack

This pattern works for any LLM tool interface:

OpenAI Function Calling

@tool
def click_element(ref: str) -> str:
    """Click an interactive element"""
    try:
        element = get_element(ref)
        element.click()
        return "Clicked successfully"
    except UnknownRefError as e:
        return format_error_for_llm(e)

Return structured error text instead of raising exceptions.

Anthropic Claude Tools

server.tool("click_element", 
  { ref: { type: "string" } },
  async ({ ref }) => {
    try {
      await clickElement(ref);
      return { content: [{ type: "text", text: "Clicked successfully" }] };
    } catch (error) {
      return { 
        content: [{ 
          type: "text", 
          text: formatErrorForLLM(error) 
        }] 
      };
    }
  }
);

LangChain Tools

class ClickElementTool(BaseTool):
    name = "click_element"
    description = "Click an interactive element"

    def _run(self, ref: str) -> str:
        try:
            click_element(ref)
            return "Clicked successfully"
        except Exception as e:
            return format_error_for_llm(e)

REST APIs for Agents

app.post('/api/click', async (req, res) => {
  try {
    await clickElement(req.body.ref);
    res.json({ success: true, message: "Clicked successfully" });
  } catch (error) {
    // Don't use HTTP error codes - return 200 with formatted error
    res.json({ 
      success: false, 
      error: formatErrorForLLM(error) 
    });
  }
});

The key: Return errors as structured text, not as exceptions or HTTP error codes. The LLM needs to read the error, not catch it.

What I Learned

Error messages are part of your API contract. When building for LLMs, they're as important as success responses.
"Tool name" is the most important piece of information. The LLM needs to know exactly what function to call to recover. Don't say "refresh" when you mean "browser_snapshot()".
Numbered lists beat prose. LLMs follow sequential steps better than paragraph instructions.
Distinguish expected from unexpected failures. "This is normal during navigation" prevents the LLM from treating temporary failures as fatal.
Track failures, classify at decision points. Don't decide if something is critical at the failure site—track it and decide where you have business context.
Warnings enable partial success. Operations can succeed with degraded functionality if failures are transparent.
Recovery plans beat descriptions. "Element not found" tells the LLM nothing. "Call get_elements(), find the element by name, retry with the correct ref" is a recovery plan.
Rich error classes enable specific guidance. UnknownRefError can give different recovery steps than StaleRefError because they have different properties.

The Shift

Traditional error handling is designed for debugging. You want stack traces, error codes, and technical details because you'll investigate and fix the code.

LLM error handling is designed for autonomous recovery. You want context, diagnosis, and exact recovery steps because the LLM will fix the situation without code changes.

The error message isn't documentation—it's a recovery script the LLM executes.

When you're building tools for autonomous AI use, every error is an opportunity for the system to self-heal. Make your error messages teach the LLM how.

DEV Community