TL;DR: When you build tools that LLMs call autonomously—OpenAI functions, Claude tools, MCP servers, custom APIs—traditional error messages break the agent workflow. A human can ask "what's a valid reference?" An LLM can't. I rebuilt Verdex's error handling to give LLMs structured recovery plans instead of descriptions. The result: errors that were conversation-killers became recoverable, autonomous workflows stopped failing silently, and the LLM could fix issues without human intervention.
The Problem: When Your User Can't Ask Questions
Here's what happens when a human gets an error:
Error: Unknown element reference: e999
The human thinks:
- "What's an element reference?"
- "Where do I get valid ones?"
- "Why did mine become invalid?"
Then they ask you, and you explain.
Here's what happens when an LLM gets that same error:
Error: Unknown element reference: e999
The LLM:
- Tries the same thing again (with the same invalid ref)
- Gets the same error
- Gives up or hallucinates a solution
- The entire agentic workflow fails
The LLM can't ask clarifying questions. It needs to autonomously recover or the task is dead.
What I Was Building
I'm working on Verdex, an MCP server for browser automation. It exposes tools like browser_click(ref), browser_type(ref, text), and browser_snapshot(). The LLM navigates pages, fills forms, and extracts data—all autonomously.
When I first deployed it, I saw this pattern constantly:
LLM: browser_click("e5")
Error: Unknown element reference: e5
LLM: browser_click("e5") // Tries exact same thing
Error: Unknown element reference: e5
LLM: "I encountered an error. The element reference appears to be invalid."
[workflow ends]
The LLM had no idea that:
- Element refs come from
browser_snapshot() - Navigation invalidates old refs
- It needed to call
browser_snapshot()first to get fresh refs
It couldn't ask me. I wasn't there. The error message didn't tell it how to recover.
The Pattern: Structured Recovery Plans
I rebuilt every error message to follow this structure:
❌ [Error Type]
[What failed - with specific details]
[Why it failed - likely causes]
🔧 Action Required:
[Numbered steps to recover]
Here's the before and after:
Before (Developer-Focused)
Error: Unknown element reference: e999
After (LLM-Focused)
❌ Unknown Element Reference
Reference: e999
This reference doesn't exist in the current snapshot.
Possible causes:
• Using a ref from an old snapshot (stale after navigation)
• Typo in the ref name
• Element not yet loaded or not interactive
🔧 Action Required:
1. Call browser_snapshot() to see currently available elements
2. Find the correct element reference in the new snapshot
3. Use the correct ref from the latest snapshot
The difference:
- Context: What a "reference" is and where they come from
- Diagnosis: Why this specific one is invalid
- Recovery: Exact API calls needed to fix it (with names!)
- Ordering: Numbered steps the LLM can follow sequentially
Implementation: Rich Error Classes
Every error type gets its own class with structured properties:
export class UnknownRefError extends Error {
constructor(public ref: string) {
super(
`Unknown element reference: ${ref}. ` +
`Ref may be stale after navigation. Take a new snapshot to get fresh refs.`
);
this.name = "UnknownRefError";
}
}
export class StaleRefError extends Error {
constructor(
public ref: string,
public elementInfo: { role: string; name: string; tagName: string }
) {
super(
`Element ${ref} (${elementInfo.role} "${elementInfo.name}") was removed from DOM. ` +
`Take a new snapshot() to refresh refs.`
);
this.name = "StaleRefError";
}
}
export class FrameDetachedError extends Error {
constructor(public frameId: string, details?: string) {
super(`Frame ${frameId} was detached${details ? `: ${details}` : ""}`);
this.name = "FrameDetachedError";
}
}
export class NavigationError extends Error {
constructor(public url: string, public role: string, details: string) {
super(`Navigation failed for role '${role}' to '${url}': ${details}`);
this.name = "NavigationError";
}
}
export class AuthenticationError extends Error {
constructor(
public role: string,
public authPath: string,
public reason: string
) {
super(
`Authentication required for role '${role}' but failed to load from ${authPath}: ${reason}`
);
this.name = "AuthenticationError";
}
}
The properties (ref, elementInfo, frameId, etc.) enable the formatter to provide specific, actionable guidance.
The Formatter: Error Messages as Recovery Scripts
At the MCP server layer, I intercept all errors and format them for LLM consumption:
async callTool(name: string, args: any) {
try {
// Route to appropriate handler...
return await handler(args);
} catch (error) {
return {
content: [{
type: "text",
text: this.formatErrorForLLM(error),
}],
};
}
}
private formatErrorForLLM(error: unknown): string {
// Unknown reference - ref doesn't exist in snapshot
if (error instanceof UnknownRefError) {
return `❌ Unknown Element Reference
Reference: ${error.ref}
This reference doesn't exist in the current snapshot.
Possible causes:
• Using a ref from an old snapshot (stale after navigation)
• Typo in the ref name
• Element not yet loaded or not interactive
🔧 Action Required:
1. Call browser_snapshot() to see currently available elements
2. Find the correct element reference in the new snapshot
3. Use the correct ref from the latest snapshot`;
}
// Stale reference - element was removed from DOM
if (error instanceof StaleRefError) {
return `❌ Stale Element Reference
Element: ${error.ref}
Type: ${error.elementInfo.role}
Label: "${error.elementInfo.name}"
Tag: <${error.elementInfo.tagName}>
The element was removed from the DOM, likely due to:
• Page navigation or refresh
• Dynamic content update
• JavaScript manipulation
🔧 Action Required:
Call browser_snapshot() to get fresh element references, then retry your action.`;
}
// Frame detached - iframe removed during operation
if (error instanceof FrameDetachedError) {
return `❌ Frame Detached
Frame ID: ${error.frameId}
An iframe was removed or navigated during the operation.
This is often normal during:
• Navigation between pages
• Single-page app (SPA) route changes
• Dynamic iframe removal by JavaScript
🔧 Action Required:
Call browser_snapshot() to see the current page structure and available frames.`;
}
// Authentication failed - required auth cannot load
if (error instanceof AuthenticationError) {
return `❌ Authentication Required
Role: ${error.role}
Auth File: ${error.authPath}
Failed to load authentication data: ${error.reason}
This role requires authentication but the auth file couldn't be loaded.
Possible causes:
• Auth file doesn't exist at specified path
• Auth file has invalid JSON format
• Auth file permissions prevent reading
• Path specified incorrectly in configuration
🔧 Action Required:
1. Verify auth file exists: ${error.authPath}
2. Check file permissions (must be readable)
3. Validate JSON format in auth file
4. If auth is optional, set authRequired: false in role config
5. Run auth capture process if credentials expired`;
}
// Navigation failed - couldn't navigate to URL
if (error instanceof NavigationError) {
return `❌ Navigation Failed
URL: ${error.url}
Role: ${error.role}
${error.message}
Possible causes:
• Invalid or unreachable URL
• Network connectivity issues
• Server error (404, 500, etc.)
• Authentication required (check warnings in snapshot)
• Timeout (page took too long to load)
• Main frame injection failed
🔧 Action Required:
• Verify the URL is correct and accessible
• Check network connectivity
• Call browser_snapshot() to see current page state
• Check role authentication status via getFailures()
• Try a different URL or retry after a moment`;
}
// Generic error fallback
if (error instanceof Error) {
return `❌ Error
${error.message}
If this error persists, check:
• Your input parameters
• Current page state (call browser_snapshot())
• Network connectivity
• Browser logs for additional context`;
}
// Unknown error type
return `❌ Unknown Error
${String(error)}
This is an unexpected error type. Please report this issue with context about what operation you were attempting.`;
}
Universal Principles for LLM Error Messages
After implementing this across 8 error types, here's what works:
1. Explicit Tool/Function Names
❌ Bad: "Get a new snapshot"
✅ Good: "Call browser_snapshot()"
The LLM needs the exact function name it should call. Don't make it guess.
2. Numbered Recovery Steps
❌ Bad: "You need to refresh the page and try again"
✅ Good:
1. Call browser_navigate(url)
2. Call browser_snapshot() to get new refs
3. Find the button in the new snapshot
4. Retry browser_click() with the new ref
LLMs follow numbered lists well. They struggle with prose instructions.
3. Explain the Why
❌ Bad: "Invalid ref"
✅ Good: "This reference doesn't exist because navigation invalidates old refs"
Understanding causation helps the LLM avoid the same mistake next time.
4. Multiple Diagnosis Options
Possible causes:
• Using a ref from an old snapshot (stale after navigation)
• Typo in the ref name
• Element not yet loaded or not interactive
The LLM can pattern-match against its recent actions to figure out which cause applies.
5. Include Structured Data
Element: e5
Type: button
Label: "Submit"
Tag: <button>
Structured info helps the LLM recognize what it was trying to interact with, making it easier to find the element in a fresh snapshot.
6. Distinguish Expected vs Unexpected
Some failures are normal:
- Frame detachment during navigation
- Cross-origin iframe access denied
- Element not found (might load later)
Mark these explicitly: "This is often normal during..." vs "This is an unexpected error."
It prevents the LLM from treating every error as fatal.
Pattern 2: Track, Then Decide
For operations with partial failures, I separate tracking from policy enforcement.
The Problem
// ❌ Don't decide criticality at failure site
try {
await injectIntoFrame(frameId);
} catch (error) {
if (frameId === mainFrameId) {
throw error; // Critical!
}
// Otherwise ignore?
}
This couples failure handling with business logic. Every injection site needs to know what's critical.
The Solution: FailureLog
type FailureLog = {
frameInjectionFailures: Array<{
frameId: string;
error: string;
reason: "cross-origin" | "detached" | "timeout" | "unknown";
isMainFrame: boolean; // Track criticality as metadata
timestamp: number;
}>;
frameExpansionFailures: Array<{
ref: string;
error: string;
detached: boolean;
timestamp: number;
}>;
authLoadError?: {
error: string;
authPath: string;
timestamp: number;
};
};
Step 1: Operations track all failures
async injectFrameTreeRecursive(
context: RoleContext,
frameTree: any,
isMainFrame: boolean = false
): Promise<void> {
try {
await context.bridgeInjector.ensureFrameState(
context.cdpSession,
frameTree.frame.id
);
} catch (error) {
// Track in FailureLog (don't throw yet)
const failures = this.ensureFailureLog(context);
failures.frameInjectionFailures.push({
frameId: frameTree.frame.id,
error: error.message,
reason: this.classifyFrameError(error),
isMainFrame, // Metadata, not decision
timestamp: Date.now(),
});
return;
}
}
Step 2: Decision points check FailureLog
async navigate(url: string): Promise<Snapshot> {
// ... navigation logic ...
await this.discoverAndInjectFrames(context);
// DECISION POINT: Check for critical failures
const mainFrameFailed = context.failures?.frameInjectionFailures
.some(f => f.isMainFrame);
if (mainFrameFailed) {
throw new Error('Main frame injection failed - page cannot be automated');
}
// Non-critical failures become warnings
snapshot.warnings = this.buildWarningsFromFailureLog(context);
return snapshot;
}
Step 3: Warnings expose non-critical failures
private buildWarningsFromFailureLog(context: RoleContext) {
const failures = context.failures;
if (!failures) return undefined;
const warnings: any = {};
// Check for inaccessible frames (non-main frames that failed)
const inaccessibleFrames = failures.frameInjectionFailures
.filter(f => !f.isMainFrame);
if (inaccessibleFrames.length > 0) {
warnings.inaccessibleFrames = inaccessibleFrames.length;
warnings.details = inaccessibleFrames.map(f =>
`Frame ${f.frameId}: ${f.reason}`
);
}
// Check for auth failures
if (failures.authLoadError) {
warnings.authStatus = "unauthenticated";
warnings.details = warnings.details || [];
warnings.details.push(`Auth failed: ${failures.authLoadError.error}`);
}
return Object.keys(warnings).length > 0 ? warnings : undefined;
}
This pattern appears in the snapshot output:
{
"text": "- button \"Submit\" [ref=e1]\n...",
"elementCount": 15,
"warnings": {
"inaccessibleFrames": 2,
"details": [
"Frame abc123: cross-origin",
"Frame def456: detached"
]
}
}
Why This Works for LLMs
The LLM sees:
- ✅ Operation succeeded (got a snapshot)
- ✅ Partial failures are transparent (warnings)
- ✅ Clear reason for each failure
- ✅ Can proceed with main content
Without warnings, the LLM doesn't know if missing content is a problem or expected behavior.
What Changed
Before: LLMs retried the same failed operation repeatedly
LLM: browser_click("e5")
Error: Unknown element reference: e5
LLM: browser_click("e5")
Error: Unknown element reference: e5
LLM: browser_click("e5")
Error: Unknown element reference: e5
[workflow fails]
After: LLMs autonomously recover
LLM: browser_click("e5")
Error: Unknown element reference: e5
[Error includes recovery steps mentioning browser_snapshot()]
LLM: browser_snapshot()
[Gets fresh refs, sees e7 is the submit button]
LLM: browser_click("e7")
Success!
Error recovery rate went from ~20% to ~95%. Most failures are now self-healing.
Conversation length decreased. Errors that required human intervention ("what ref should I use?") now resolve autonomously.
Workflow reliability improved. Multi-step tasks that would fail on the first error now complete end-to-end.
Adapting This to Your Stack
This pattern works for any LLM tool interface:
OpenAI Function Calling
@tool
def click_element(ref: str) -> str:
"""Click an interactive element"""
try:
element = get_element(ref)
element.click()
return "Clicked successfully"
except UnknownRefError as e:
return format_error_for_llm(e)
Return structured error text instead of raising exceptions.
Anthropic Claude Tools
server.tool("click_element",
{ ref: { type: "string" } },
async ({ ref }) => {
try {
await clickElement(ref);
return { content: [{ type: "text", text: "Clicked successfully" }] };
} catch (error) {
return {
content: [{
type: "text",
text: formatErrorForLLM(error)
}]
};
}
}
);
LangChain Tools
class ClickElementTool(BaseTool):
name = "click_element"
description = "Click an interactive element"
def _run(self, ref: str) -> str:
try:
click_element(ref)
return "Clicked successfully"
except Exception as e:
return format_error_for_llm(e)
REST APIs for Agents
app.post('/api/click', async (req, res) => {
try {
await clickElement(req.body.ref);
res.json({ success: true, message: "Clicked successfully" });
} catch (error) {
// Don't use HTTP error codes - return 200 with formatted error
res.json({
success: false,
error: formatErrorForLLM(error)
});
}
});
The key: Return errors as structured text, not as exceptions or HTTP error codes. The LLM needs to read the error, not catch it.
What I Learned
Error messages are part of your API contract. When building for LLMs, they're as important as success responses.
"Tool name" is the most important piece of information. The LLM needs to know exactly what function to call to recover. Don't say "refresh" when you mean "
browser_snapshot()".Numbered lists beat prose. LLMs follow sequential steps better than paragraph instructions.
Distinguish expected from unexpected failures. "This is normal during navigation" prevents the LLM from treating temporary failures as fatal.
Track failures, classify at decision points. Don't decide if something is critical at the failure site—track it and decide where you have business context.
Warnings enable partial success. Operations can succeed with degraded functionality if failures are transparent.
Recovery plans beat descriptions. "Element not found" tells the LLM nothing. "Call
get_elements(), find the element by name, retry with the correct ref" is a recovery plan.Rich error classes enable specific guidance.
UnknownRefErrorcan give different recovery steps thanStaleRefErrorbecause they have different properties.
The Shift
Traditional error handling is designed for debugging. You want stack traces, error codes, and technical details because you'll investigate and fix the code.
LLM error handling is designed for autonomous recovery. You want context, diagnosis, and exact recovery steps because the LLM will fix the situation without code changes.
The error message isn't documentation—it's a recovery script the LLM executes.
When you're building tools for autonomous AI use, every error is an opportunity for the system to self-heal. Make your error messages teach the LLM how.
Top comments (0)