In our journey to build an enterprise-grade AI agent directly in the browser, we have laid a solid foundation. In Part 1, we tackled performance by offloading the ReAct reasoning loop into Web Workers. In Part 2, we cured our agent’s “goldfish memory” by implementing a persistent, cross-thread IndexedDB storage layer.
Our agent is fast and it remembers. But as we start adding more tools and giving it more complex tasks, two massive structural problems emerge: Context Window Exhaustion and the dreaded Infinite Tool Loop.
Today, we are going to fix both. We will evolve our architecture by introducing a Retrieval-Augmented Generation (RAG) pattern for our tools, and we will stabilize the on-device model’s reasoning using Few-Shot Prompting and Conversational Delta Prompting.
Challenge 1: The Context Window Scalability Problem
When we first built our agent, we hardcoded the tool descriptions directly into the system prompt. If you have two tools (Calculator and FetchData), this works fine.
But what happens when your application scales? What if you have 50 specialized tools for checking user permissions, parsing CSVs, querying local databases, and triggering UI modals?
If you dump 50 tool descriptions into the prompt, two things can happen:
- You exhaust the token limit. Smaller on-device models like Gemini Nano have strict context windows.
- You confuse the model. The model struggles to find the “signal in the noise,” leading to hallucinated tool calls.
The Solution: A Lightweight Tool Retriever (RAG)
Instead of giving the model every tool, we should only give it the tools relevant to the user’s current request. This is a localized version of RAG (Retrieval-Augmented Generation).
To solve this, we introduce the ToolRetriever class into our web worker.
class ToolRetriever {
constructor(toolsArray) {
this.tools = toolsArray;
}
// A lightweight retrieval mechanism to find the top K relevant tools
async getRelevantTools(userPrompt, topK = 3) {
if (this.tools.length <= topK) return this.tools;
const query = userPrompt.toLowerCase();
// Score tools based on relevance to the prompt
const scoredTools = this.tools.map(tool => {
let score = 0;
const targetText = (tool.name + " " + tool.description).toLowerCase();
// Basic token overlap scoring (Simulating a BM25 or Embedding search)
const queryTokens = query.split(/\W+/);
for (const token of queryTokens) {
if (token.length > 3 && targetText.includes(token)) {
score += 1;
}
}
return { tool, score };
});
return scoredTools
.sort((a, b) => b.score - a.score)
.slice(0, topK)
.map(st => st.tool);
}
}
By filtering our tools down to the topK (e.g., the top 3), we keep the prompt lean and focused. If a developer registers 100 tools, the model only ever sees the 3 that actually matter for the specific task at hand which will make the context thiner and result in a more accurate responses from the model.
Challenge 2: The Infinite Tool Loop
If you have spent any time building ReAct agents, you have likely encountered the infinite loop. It looks like this:
Thought: I need to calculate 542 * 13.
Action: Calculator(“542 * 13”)
Observation: 7046
Thought: I need to calculate 542 * 13.
Action: Calculator(“542 * 13”) (…repeats until max iterations reached)
Why does this happen? It boils down to Context Recency Bias and Format Mismatch. Smaller models struggle with zero-shot JSON generation. When we just throw data at them, they lose track of the conversation structure, ignore the Observation, and default back to trying to solve the original user prompt over and over again.
The Solution: Prompt Templates & Few-Shot Examples
To fix this, we must rigidly structure how the model perceives its own actions. We do this by creating a PromptTemplate class.
Instead of trusting the model to figure out the JSON format, we provide Few-Shot Examples. We explicitly show it what a successful tool interaction looks like. Notice the --- Execution Trace --- block below, this is critical for helping the model differentiate between user input and its own past actions.
class PromptTemplate {
constructor() {
this.systemInstruction = `You are an autonomous AI agent with long-term memory. Think step-by-step.
You must STRICTLY output valid JSON matching the schema.
Rules:
1. If you need data, set "toolName" to a tool and "toolInput" to the query. Leave "finalAnswer" as "".
2. If you know the answer, set "toolName" to "none" and put the answer in "finalAnswer".`;
this.fewShotExamples = `
--- Example 1: Using a Tool ---
User: What is the current stock price of Apple?
--- Execution Trace ---
{"thought": "I need to look up the real-time stock price for Apple (AAPL).", "toolName": "FetchStockPrice", "toolInput": "AAPL", "finalAnswer": ""}
Observation from FetchStockPrice: 175.50
{"thought": "I have the observation. I can now provide the final answer.", "toolName": "none", "toolInput": "", "finalAnswer": "The current stock price of Apple is $175.50."}
--- Example 2: Answering Directly ---
User: What is the capital of France?
{"thought": "I know the capital of France is Paris. No tool is needed.", "toolName": "none", "toolInput": "", "finalAnswer": "The capital of France is Paris."}
`;
}
format(relevantTools, historyTurns, userPrompt) {
const toolDescriptions = relevantTools.length > 0
? relevantTools.map(t => `- ${t.name}: ${t.description}`).join('\n')
: "- none: No external tools available for this query.";
return `${this.systemInstruction}
Available tools for this request:
${toolDescriptions}
- none: Use this if you do not need a tool.
${this.fewShotExamples}
--- Current Conversation ---
Prior History:
${historyTurns.length > 0 ? historyTurns.join('\n') : "No prior history."}
User: ${userPrompt}
Output your next step as JSON:`;
}
}
The Master Stroke: Conversational Delta Prompting
Even with the template, there is one final hurdle. The Chrome Prompt API is inherently a stateful, conversational API. If we send that massive PromptTemplate into the model on every single loop iteration, the model gets confused by seeing the massive instruction set repeated mid-conversation.
We must switch to Conversational Delta Prompting.
- Turn 1: We send the massive PromptTemplate (Instructions, Tools, Examples, User Prompt).
- Turn 2 (and beyond): We only send the observation back to the model.
Because Chrome’s API maintains the session state, the model remembers the instructions from Turn 1. Here is how we update the core runReActLoop to support this:
async function runReActLoop(userPrompt, sessionId) {
let isComplete = false;
let finalResult = "";
let loopCount = 0;
const historyTurns = await memory.getHistory(sessionId);
const relevantTools = await toolRetriever.getRelevantTools(userPrompt, 3);
const toolsMap = new Map(relevantTools.map(t => [t.name, t]));
let currentTurnLog = `User: ${userPrompt}\n`;
let currentPrompt = promptTemplate.format(relevantTools, historyTurns, userPrompt);
while (!isComplete && loopCount < 7) {
loopCount++;
const responseText = await askLLM(currentPrompt);
let response;
try {
response = JSON.parse(responseText);
} catch (e) {
currentPrompt = `Observation: Invalid JSON format received. You must respond strictly in JSON syntax.`;
continue;
}
if (response.thought) {
logToMain(`Thought: ${response.thought}`);
currentTurnLog += `Thought: ${response.thought}\n`;
}
if (response.finalAnswer && response.finalAnswer.trim() !== "") {
finalResult = response.finalAnswer;
currentTurnLog += `Assistant: ${response.finalAnswer}\n`;
isComplete = true;
}
else if (response.toolName && response.toolName !== "none" && toolsMap.has(response.toolName)) {
logToMain(`Action: Running ${response.toolName} with input "${response.toolInput}"`);
try {
const tool = toolsMap.get(response.toolName);
const toolResult = await tool.executeFn(response.toolInput);
currentTurnLog += `Action: ${response.toolName}("${response.toolInput}")\nObservation: ${toolResult}\n`;
logToMain(`Observation: ${toolResult}`);
currentPrompt = `Observation from ${response.toolName}: ${toolResult}\nGiven this observation, output your next step as JSON:`;
} catch (err) {
currentTurnLog += `Observation: Tool failed with error: ${err.message}\n`;
currentPrompt = `Observation: Tool failed with error: ${err.message}\nGiven this observation, output your next step as JSON:`;
}
}
else if (response.toolName === "none" || response.toolName === "") {
currentPrompt = `Observation: You set toolName to "none" but omitted a finalAnswer. Provide your final answer text in the JSON.`;
}
else {
currentPrompt = `Observation: Tool '${response.toolName}' is not loaded. Select from available tools or use 'none'.`;
}
}
if (finalResult) {
historyTurns.push(currentTurnLog.trim());
if (historyTurns.length > 10) historyTurns.shift();
await memory.saveHistory(sessionId, historyTurns);
}
return finalResult || "Error: Reached maximum iterations.";
}
Summary
By combining a localized RAG Tool Retriever with strict Few-Shot Templates and Conversational Delta Prompting , we have transformed a brittle proof-of-concept into a more reliable agent engine.
It can now handle hundreds of registered tools without blowing the token limit, and it gracefully processes multi-step tool interactions without falling into infinite, amnesiac loops.
Our browser-based agent is growing up.
If you are interested in the code, you can find it on my Github — https://github.com/gilf/prompt-chain.

Top comments (0)