Making the Browser Agent Resilient: Implementing Autonomous Error Handling and Retries

#ai #javascript #promptapi #errors

In the previous installments of our series on building browser-based AI agents, we tackled performance by offloading the reasoning loop to Web Workers, achieved persistence using IndexedDB, and mastered contextual intelligence using local RAG and few-shot prompting.

Our agent is fast, stateful, and intelligent. But in the messy, real world of front-end development, intelligence is not enough . To build truly autonomous systems, we must also build for resilience.

Until now, our agent loop made a fatal architectural assumption: it assumed that once the LLM decided to use a tool, that tool would execute perfectly, every time. As front-end developers, we know this is a naive dream. APIs time out, network requests fail with 500s, and third-party scripts crash. If your agent is truly autonomous, it must have a mechanism to recover from these failures on its own, before ever reporting back to the user with a broken result.

Today, we will implement autonomous error handling and retry logic right inside our ReAct loop.

The Architectural Mandate for Autonomy

When an AI agent’s tool execution fails, we need to classify that failure into two categories to determine the best approach:

Recoverable Errors (Systemic): These are transient, recoverable errors like a temporary network hiccup (HTTP 502/503), a tool execution timing out, or a rate limit (HTTP 429). In these scenarios, the agent should not give up. It should autonomously decide to wait and retry , because there’s a high probability the next call will succeed.
Persistent Errors (Model Hallucinations/Logic Bugs): These are errors that retrying will not fix. This includes the model hallucinating a tool that doesn’t exist, providing incorrect parameters that crash a function, or a genuine bug in our tool’s implementation. In this scenario, retrying is a waste of resources. Instead, we must capture the error and feed it back into the model’s context window as an Observation. This forces the agent to autonomously reason: "My last action failed with this specific error message. I must now regenerate my plan, correct my parameters, or try a different approach entirely."

This system of retrying systemic failures and self-correcting logic failures is essential for enterprise-grade autonomy.

The Toolbox

To make our implementation clean and maintainable, we first create a set of essential helper functions in a new utils.js file, completely separate from the core agent engine.

export function isRecoverableError(error) {
    const msg = error.message.toLowerCase();
    return (
        msg.includes("timeout") ||
        msg.includes("time out") ||
        msg.includes("fetch") ||
        msg.includes("network") ||
        msg.includes("http error") ||
        msg.includes("status 5") ||
        msg.includes("status 429") ||
        msg.includes("rate limit") ||
        error.name === "TimeoutError"
    );
}

export async function runWithTimeout(executeFn, input, timeoutMs = 3000) {
    return new Promise((resolve, reject) => {
        const timer = setTimeout(() => {
            const err = new Error("Tool execution timed out.");
            err.name = "TimeoutError";
            reject(err);
        }, timeoutMs);

        Promise.resolve(executeFn(input))
            .then(result => {
                clearTimeout(timer);
                resolve(result);
            })
            .catch(err => {
                clearTimeout(timer);
                reject(err);
            });
    });
}

export function delay(ms) {
    return new Promise(resolve => setTimeout(resolve, ms));
}

This toolkit allows us to enforce time limits on tool calls, categorize failures as transient or critical, and wait before retrying, all essential patterns for building resilient systems on top of asynchronous native APIs.

Integrating in the Loop

Now, we integrate this logic directly into our web worker’s core ReAct agent loop runReActLoop. We must replace the simple, direct tool execution with a robust while (retryCount <= maxRetries && !success) loop structure. This new block handles the classification and response to any tool execution failure entirely inside the background thread.

async function runReActLoop(userPrompt, sessionId) {
    let isComplete = false;
    let finalResult = "";
    let loopCount = 0;

    const historyTurns = await memory.getHistory(sessionId);
    const relevantTools = await toolRetriever.getRelevantTools(userPrompt, 3);
    const toolsMap = new Map(relevantTools.map(t => [t.name, t]));

    let currentTurnLog = `User: ${userPrompt}\n`;
    let currentPrompt = promptTemplate.format(relevantTools, historyTurns, userPrompt);

    while (!isComplete && loopCount < 7) {
        loopCount++;

        const responseText = await askLLM(currentPrompt);
        let response;

        try {
            response = JSON.parse(responseText);
        } catch (e) {
            currentPrompt = `Observation: Invalid JSON format received. You must respond strictly in JSON syntax.`;
            continue;
        }

        if (response.thought) {
            logToMain(`Thought: ${response.thought}`);
            currentTurnLog += `Thought: ${response.thought}\n`;
        }

        if (response.finalAnswer && response.finalAnswer.trim() !== "") {
            finalResult = response.finalAnswer;
            currentTurnLog += `Assistant: ${response.finalAnswer}\n`;
            isComplete = true;
        }
        else if (response.toolName && response.toolName !== "none" && toolsMap.has(response.toolName)) {
            logToMain(`Action: Running ${response.toolName} with input "${response.toolInput}"`);

            const tool = toolsMap.get(response.toolName);
            let toolResult;
            let success = false;
            let retryCount = 0;
            const maxRetries = 3;

            while (retryCount <= maxRetries && !success) {
                try {
                    toolResult = await runWithTimeout(tool.executeFn, response.toolInput, 3000);
                    success = true;
                } catch (err) {
                    if (isRecoverableError(err) && retryCount < maxRetries) {
                        retryCount++;
                        logToMain(`Observation: Tool timed out. Retrying...`);
                        await delay(1000);
                    } else {
                        currentTurnLog += `Action: ${response.toolName}("${response.toolInput}")\nObservation: Tool failed with error: ${err.message}\n`;
                        logToMain(`Observation: Tool failed with error: ${err.message}`);
                        currentPrompt = `Observation: Tool '${response.toolName}' failed because: ${err.message}. Please correct the input/parameters, try a different approach, or check tool availability, and try again.`;
                        break;
                    }
                }
            }

            if (success) {
                currentTurnLog += `Action: ${response.toolName}("${response.toolInput}")\nObservation: ${toolResult}\n`;
                logToMain(`Observation: ${toolResult}`);
                currentPrompt = `Observation from ${response.toolName}: ${toolResult}\nGiven this observation, output your next step as JSON:`;
            }
        }
        else if (response.toolName === "none" || response.toolName === "") {
            currentPrompt = `Observation: You set toolName to "none" but omitted a finalAnswer. Provide your final answer text in the JSON.`;
        }
        else {
            currentPrompt = `Observation: Tool '${response.toolName}' is not loaded. Select from available tools or use 'none'.`;
        }
    }

    if (finalResult) {
        historyTurns.push(currentTurnLog.trim());
        if (historyTurns.length > 10) historyTurns.shift();
        await memory.saveHistory(sessionId, historyTurns);
    }

    return finalResult || "Error: Reached maximum iterations.";
}

Summary

By merging our first-turn full prompt with stateful conversational delta prompting, few-shot templates, RAG tool retrieval, and now, an error handling and retry mechanism, we have built an nice robust architectural pattern.

We are no longer just asking Chrome’s Prompt API to perform text generation. We have built an Autonomous Agent Platform. Our browser agent is not just growing up, it’s now resilient enough to face the challenges of a production environment.

What else do you think I should add to the implementation to make the implementation even better?

If you are interested in the code, you can find it on my Github — https://github.com/gilf/prompt-chain.