DEV Community

Webby Wisp
Webby Wisp

Posted on

Why Your AI Agent Failed to Remember: Context Window Management in Claude

You built a shiny new AI agent. It runs for a few turns, answers questions beautifully — and then suddenly hits a wall. The agent stops understanding earlier context, gives nonsensical answers, or costs you a fortune in token spend. What happened?

Context window management.

Most developers overlook this until it breaks. But if you're building production AI agents, managing context is non-negotiable. I've shipped agents that died silently from poor context handling, and learned the hard way. Let me save you that pain.

The Problem: Context Isn't Free

Claude's context window is large (100K–200K tokens depending on the model), but it's not infinite. Every message, system prompt, and tool call eats into it. When you hit the limit:

  1. You lose older messages — the agent forgets conversation history
  2. Token costs spike — each new message reprocesses remaining context
  3. Quality degrades — without history, decisions get worse
  4. Users get frustrated — "But we literally just talked about this!"

Most agent frameworks punt on this problem. They either crash when context overflows or let you waste money reprocessing the same data.

The Solution: Intentional Summarization

The fix is simple but requires discipline: summarize and prune context before it becomes a problem.

Here's a practical approach I use in production:

class ContextAwareAgent {
  constructor(maxContextTokens = 50000) {
    this.messages = [];
    this.maxContextTokens = maxContextTokens;
    this.summaries = []; // Keep summaries of pruned sections
  }

  estimateTokens(text) {
    // Rough estimate: 1 token ≈ 4 characters
    // For production, use actual tokenizer (js-tiktoken)
    return Math.ceil(text.length / 4);
  }

  calculateCurrentContext() {
    let total = this.messages.reduce((sum, msg) => {
      return sum + this.estimateTokens(msg.content);
    }, 0);

    // Add summaries too
    total += this.summaries.reduce((sum, summary) => {
      return sum + this.estimateTokens(summary);
    }, 0);

    return total;
  }

  async pruneOldMessages() {
    if (this.calculateCurrentContext() < this.maxContextTokens) {
      return; // Not yet necessary
    }

    // Keep the most recent N messages
    const keepCount = 5;
    const toPrune = this.messages.slice(0, -keepCount);
    const toKeep = this.messages.slice(-keepCount);

    if (toPrune.length === 0) return;

    // Summarize what we're removing
    const prunedText = toPrune
      .map(m => `${m.role}: ${m.content}`)
      .join('\n\n');

    const summary = await this.callClaude(
      `Summarize this conversation section in 2-3 sentences, preserving key decisions and facts:\n\n${prunedText}`
    );

    this.summaries.push(summary);
    this.messages = toKeep;
  }

  async addMessage(role, content) {
    this.messages.push({ role, content });
    await this.pruneOldMessages();
  }

  async callClaude(userInput) {
    await this.pruneOldMessages(); // Prune before making the call

    const systemMessages = [
      "You are a helpful AI agent.",
      this.summaries.length > 0 
        ? `Previous context summaries:\n\n${this.summaries.join('\n\n')}` 
        : ""
    ].filter(Boolean).join('\n\n');

    const response = await fetch('https://api.anthropic.com/v1/messages/create', {
      method: 'POST',
      headers: {
        'x-api-key': process.env.ANTHROPIC_API_KEY,
        'content-type': 'application/json',
      },
      body: JSON.stringify({
        model: 'claude-3-5-sonnet-20241022',
        max_tokens: 1024,
        system: systemMessages,
        messages: [...this.messages, { role: 'user', content: userInput }],
      }),
    });

    const data = await response.json();
    const assistantResponse = data.content[0].text;

    await this.addMessage('assistant', assistantResponse);
    return assistantResponse;
  }
}

// Usage:
const agent = new ContextAwareAgent(50000);
await agent.callClaude("What's your name?");
await agent.callClaude("Remember that. Now tell me a joke.");
// Agent can still reference earlier messages, even if they're pruned
Enter fullscreen mode Exit fullscreen mode

Why This Works

  1. Lazy pruning — only summarize when necessary, saving API calls
  2. Preserved knowledge — summaries keep important facts and decisions
  3. Cost control — pruning prevents token explosion
  4. Transparent — you control what gets kept vs. summarized

Production Tips

  • Use real tokenizers. Install js-tiktoken and count actual tokens, not character estimates.
  • Watch your summaries. If summaries get too long, you're not being aggressive enough with pruning.
  • Version your summaries. If context quality degrades, you can inspect what got summarized.
  • Test edge cases. A 100-turn conversation behaves very differently from a 10-turn one.

Tools That Help

If you're building this at scale, consider:

  • LangChain's buffer/summary chains — built-in context management
  • Claude's system prompt — use it to define context constraints upfront
  • MCP servers — offload context-heavy operations to external tools
  • Claude Code — use it to prototype and test context strategies quickly

The Bottom Line

Context management isn't glamorous, but it's the difference between a toy agent and a production one. Implement this early, monitor it, and iterate. Your future self (and your users) will thank you.


Ready to build smarter agents? The AI Agent Workspace Kit includes production-ready templates with context management baked in. Or start lightweight with npx @webbywisp/create-ai-agent and add these patterns as you scale.

Top comments (0)