Escaping the Token Limit: Context Compression and Infinite Memory for Browser AI Agents

#agents #tokeneconomy #ai #promptapi

In our ongoing series on building enterprise-grade AI agents directly in the browser, we’ve tackled some serious architectural hurdles. We offloaded the heavy ReAct loop to Web Workers to save the main thread, added IndexedDB for persistence, implemented RAG for dynamic tool retrieval, and built autonomous retry logic to handle inevitable systemic failures.

Our agent is now fast, stateful, and highly resilient. But as users engage with our agent in longer, more complex sessions, we hit a hard physical limitation of on-device AI: the context window.

Local models like Chrome’s built-in Gemini Nano are incredibly powerful, but they have strict token limits. Up until now, our AgentMemory implementation used a simple sliding window, keeping only the last 10 conversational turns and blindly dropping the rest.

The problem? If a user tells the agent their dietary preferences on Turn 1, and asks for a recipe on Turn 12, the agent has completely forgotten the constraints. Truncation causes amnesia.

Today, we are going to solve this by introducing context compression via summarization. We will recursively summarize the conversation in the background, allowing the agent to maintain theoretically infinite memory without ever blowing out the prompt API’s token limits.

The Architecture of Context Compression

The concept is straightforward, but requires careful orchestration within our web worker. Instead of simply pushing raw text into an array until it overflows, we introduce a background compression cycle.

Here is the strategy to do that:

The threshold: We let the raw conversation history grow up to a specific limit (e.g., 5 turns).
The summarization prompt: Once the threshold is hit, we pause and construct a new prompt entirely in the background. We ask the LLM to read the existing summary plus the new raw turns, and generate a combined, updated summary.
The recency buffer: We keep the updated summary as our “long-term background context” and retain only the last 1 or 2 raw turns for immediate, short-term conversational flow.
Persistence: We save both the summary and the truncated history back to IndexedDB.

Let’s dive into the implementation.

Upgrading IndexedDB for Summaries

First, we need to update our AgentMemory class to support storing and retrieving a summary string alongside the raw history array. Because IndexedDB is a NoSQL object store, this schema change is completely seamless.

export class AgentMemory {
    constructor(dbName = "AgentMemoryDB", storeName = "conversations") {
        this.dbName = dbName;
        this.storeName = storeName;
        this.db = null;
    }

    init() {
        return new Promise((resolve, reject) => {
            const request = indexedDB.open(this.dbName, 1);

            request.onupgradeneeded = (e) => {
                const db = e.target.result;
                if (!db.objectStoreNames.contains(this.storeName)) {
                    db.createObjectStore(this.storeName, { keyPath: "sessionId" });
                }
            };

            request.onsuccess = (e) => {
                this.db = e.target.result;
                resolve();
            };

            request.onerror = (e) => reject(e.target.error);
        });
    }

    getHistory(sessionId) {
        return new Promise((resolve) => {
            const tx = this.db.transaction(this.storeName, "readonly");
            const store = tx.objectStore(this.storeName);
            const request = store.get(sessionId);

            request.onsuccess = () => {
                if (request.result) {
                    resolve({
                        history: request.result.history || [],
                        summary: request.result.summary || ""
                    });
                } else {
                    resolve({ history: [], summary: "" });
                }
            };
            request.onerror = () => resolve({ history: [], summary: "" });
        });
    }

    saveHistory(sessionId, history, summary = "") {
        return new Promise((resolve, reject) => {
            const tx = this.db.transaction(this.storeName, "readwrite");
            const store = tx.objectStore(this.storeName);
            const request = store.put({ sessionId, history, summary });

            request.onsuccess = () => resolve();
            request.onerror = (e) => reject(e.target.error);
        });
    }
}

The Compression Engine

Next, we build the core logic for the compression. This function needs access to the askLLM helper from our worker so it can trigger a background prompt.

Notice a critical architectural detail here: when we call askLLM(summaryPrompt, null), we pass null for the JSON schema. While our main ReAct loop relies on strict JSON to execute tools, summarization is a pure text-generation task. We drop the schema constraint to let the model write freely.

export async function compressHistory(historyTurns, conversationSummary, askLLM, logToMain) {
    const SUMMARIZATION_THRESHOLD = 5;
    const RECENCY_TURNS_TO_KEEP = 2;

    if (historyTurns.length < SUMMARIZATION_THRESHOLD) {
        return { historyTurns, updatedSummary: conversationSummary };
    }

    logToMain("Summarizing conversation context to compress memory...");
    let summaryPrompt = "";
    if (conversationSummary) {
        summaryPrompt = `Based on the following existing summary and the new conversation history, write an updated, concise summary that retains all key facts, decisions, and user preferences.
            Existing Summary:
            ${conversationSummary}

            New Conversation History:
            ${historyTurns.join('\n')}

            Output only the updated summary text. Do not output JSON.`;
    } else {
        summaryPrompt = `Based on the following conversation history, write a concise summary that retains all key facts, decisions, and user preferences.
            Conversation History:
            ${historyTurns.join('\n')}

            Output only the summary text. Do not output JSON.`;
    }

    let updatedSummary = conversationSummary;
    let updatedHistory = historyTurns;
    try {
        const rawSummary = await askLLM(summaryPrompt, null);
        updatedSummary = rawSummary.trim();
        logToMain(`New Conversation Summary: ${updatedSummary}`);
        updatedHistory = historyTurns.slice(-RECENCY_TURNS_TO_KEEP);
    } catch (err) {
        logToMain(`Failed to summarize conversation: ${err.message}. Saving history without summarization.`);
    }

    return { historyTurns: updatedHistory, updatedSummary };
}

This function exists in the utils.js file.

Injecting the Summary into the Prompt

With the summary generated, we must ensure the agent is actually aware of it during its reasoning phase. We update our PromptTemplate to inject the summary right above the immediate conversational history.

export class PromptTemplate {
    constructor() {
        this.systemInstruction = `You are an autonomous AI agent with long-term memory. Think step-by-step.
            You must STRICTLY output valid JSON matching the schema.

            Rules:
            1. If you need data, set "toolName" to a tool and "toolInput" to the query. Leave "finalAnswer" as "".
            2. If you know the answer, set "toolName" to "none" and put the answer in "finalAnswer".`;

        this.fewShotExamples = `
            --- Example 1: Using a Tool ---
            User: What is the current stock price of Apple?
            {"thought": "I need to look up the real-time stock price for Apple (AAPL).", "toolName": "FetchStockPrice", "toolInput": "AAPL", "finalAnswer": ""}
            Observation from FetchStockPrice: 175.50
            {"thought": "I have the observation. I can now provide the final answer.", "toolName": "none", "toolInput": "", "finalAnswer": "The current stock price of Apple is $175.50."}

            --- Example 2: Answering Directly ---
            User: What is the capital of France?
            {"thought": "I know the capital of France is Paris. No tool is needed.", "toolName": "none", "toolInput": "", "finalAnswer": "The capital of France is Paris."}
            `;
    }

    format(relevantTools, historyTurns, userPrompt, summary = "") {
        const toolDescriptions = relevantTools.length > 0
            ? relevantTools.map(t => `- ${t.name}: ${t.description}`).join('\n')
            : "- none: No external tools available for this query.";

        const summaryPart = summary
            ? `Conversation Summary (Background Context):\n${summary}\n\n`
            : "";

        return `${this.systemInstruction}           
            Available tools for this request:
            ${toolDescriptions}
            - none: Use this if you do not need a tool.

            ${this.fewShotExamples}

            --- Current Conversation ---
            ${summaryPart}Prior History:
            ${historyTurns.length > 0 ? historyTurns.join('\n') : "No prior history."}

            User: ${userPrompt}
            Output your next step as JSON:`;
    }
}

Closing the Loop in the Worker

Finally, we tie it all together in our runReActLoop. Because we've already abstracted the heavy lifting, the integration here is incredibly clean. We fetch the summary at the start of the loop, pass it to the template, and then compress the history right before saving.

async function runReActLoop(userPrompt, sessionId) {
    let isComplete = false;
    let finalResult = "";
    let loopCount = 0;

    let { history: historyTurns, summary: conversationSummary } = await memory.getHistory(sessionId);
    const relevantTools = await toolRetriever.getRelevantTools(userPrompt, 3);
    const toolsMap = new Map(relevantTools.map(t => [t.name, t]));

    let currentTurnLog = `User: ${userPrompt}\n`;
    let currentPrompt = promptTemplate.format(relevantTools, historyTurns, userPrompt, conversationSummary);

    while (!isComplete && loopCount < 7) {
        loopCount++;

        const responseText = await askLLM(currentPrompt);
        let response;

        try {
            response = JSON.parse(responseText);
        } catch (e) {
            currentPrompt = `Observation: Invalid JSON format received. You must respond strictly in JSON syntax.`;
            continue;
        }

        if (response.thought) {
            logToMain(`Thought: ${response.thought}`);
            currentTurnLog += `Thought: ${response.thought}\n`;
        }

        if (response.finalAnswer && response.finalAnswer.trim() !== "") {
            finalResult = response.finalAnswer;
            currentTurnLog += `Assistant: ${response.finalAnswer}\n`;
            isComplete = true;
        }
        else if (response.toolName && response.toolName !== "none" && toolsMap.has(response.toolName)) {
            logToMain(`Action: Running ${response.toolName} with input "${response.toolInput}"`);

            const tool = toolsMap.get(response.toolName);
            let toolResult;
            let success = false;
            let retryCount = 0;
            const maxRetries = 3;

            while (retryCount <= maxRetries && !success) {
                try {
                    toolResult = await runWithTimeout(tool.executeFn, response.toolInput, 3000);
                    success = true;
                } catch (err) {
                    if (isRecoverableError(err) && retryCount < maxRetries) {
                        retryCount++;
                        logToMain(`Observation: Tool timed out. Retrying...`);
                        await delay(1000);
                    } else {
                        currentTurnLog += `Action: ${response.toolName}("${response.toolInput}")\nObservation: Tool failed with error: ${err.message}\n`;
                        logToMain(`Observation: Tool failed with error: ${err.message}`);
                        currentPrompt = `Observation: Tool '${response.toolName}' failed because: ${err.message}. Please correct the input/parameters, try a different approach, or check tool availability, and try again.`;
                        break;
                    }
                }
            }

            if (success) {
                currentTurnLog += `Action: ${response.toolName}("${response.toolInput}")\nObservation: ${toolResult}\n`;
                logToMain(`Observation: ${toolResult}`);
                currentPrompt = `Observation from ${response.toolName}: ${toolResult}\nGiven this observation, output your next step as JSON:`;
            }
        }
        else if (response.toolName === "none" || response.toolName === "") {
            currentPrompt = `Observation: You set toolName to "none" but omitted a finalAnswer. Provide your final answer text in the JSON.`;
        }
        else {
            currentPrompt = `Observation: Tool '${response.toolName}' is not loaded. Select from available tools or use 'none'.`;
        }
    }

    if (finalResult) {
        historyTurns.push(currentTurnLog.trim());
        const compressionResult = await compressHistory(historyTurns, conversationSummary, askLLM, logToMain);
        await memory.saveHistory(sessionId, compressionResult.historyTurns, compressionResult.updatedSummary);
    }

    return finalResult || "Error: Reached maximum iterations.";
}

Summary

By implementing this background compression cycle, we have effectively uncoupled our agent’s memory capacity from the prompt window’s physical limitations.

The main thread remains completely unblocked. The agent retains long-term user preferences in a highly compressed format, while keeping the last few turns completely raw to ensure the conversational interaction feels natural and responsive.

Building robust front-end architectures isn’t just about managing the DOM anymore, it’s about managing state, asynchronous orchestration, and now, LLM token economics.

If you are interested in the code, you can find it on my Github — https://github.com/gilf/prompt-chain.