DEV Community

brian austin
brian austin

Posted on

How to build a Claude AI context manager that never hits token limits

How to build a Claude AI context manager that never hits token limits

If you've ever built a Claude chatbot, you've hit this wall: conversation history grows until you exceed the context window and the API throws an error.

Most tutorials ignore this problem. Production apps can't.

Here's a complete context manager that keeps conversations within token limits — automatically trimming old messages while preserving the system prompt and recent context.

The problem

Claude Haiku has a 200K token context window. That sounds massive, but a long conversation with detailed responses can fill it faster than you'd expect. When it fills:

AnthropicError: prompt is too long: 201847 tokens > 200000 maximum
Enter fullscreen mode Exit fullscreen mode

Your app crashes. The user loses their conversation. Bad.

The solution: sliding window context management

const Anthropic = require('@anthropic-ai/sdk');

const client = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY
});

// Rough token estimator (Claude uses ~4 chars per token on average)
function estimateTokens(text) {
  return Math.ceil(text.length / 4);
}

function estimateMessagesTokens(messages) {
  return messages.reduce((total, msg) => {
    return total + estimateTokens(msg.content) + 4; // 4 tokens overhead per message
  }, 0);
}

class ContextManager {
  constructor(options = {}) {
    this.maxTokens = options.maxTokens || 180000; // Leave 20K buffer for response
    this.systemPrompt = options.systemPrompt || 'You are a helpful assistant.';
    this.messages = [];
    this.trimCount = 0;
  }

  addMessage(role, content) {
    this.messages.push({ role, content });
    this.trim();
  }

  trim() {
    const systemTokens = estimateTokens(this.systemPrompt);
    let available = this.maxTokens - systemTokens;

    // Count from the end (keep most recent messages)
    let keepFrom = this.messages.length;
    let totalTokens = 0;

    for (let i = this.messages.length - 1; i >= 0; i--) {
      const msgTokens = estimateTokens(this.messages[i].content) + 4;
      if (totalTokens + msgTokens > available) {
        keepFrom = i + 1;
        break;
      }
      totalTokens += msgTokens;
      keepFrom = i;
    }

    if (keepFrom > 0) {
      const trimmed = keepFrom;
      this.messages = this.messages.slice(keepFrom);
      this.trimCount += trimmed;
      console.log(`Trimmed ${trimmed} messages (${this.trimCount} total trimmed)`);
    }
  }

  async chat(userMessage) {
    this.addMessage('user', userMessage);

    const response = await client.messages.create({
      model: 'claude-haiku-4-5',
      max_tokens: 1024,
      system: this.systemPrompt,
      messages: this.messages
    });

    const assistantMessage = response.content[0].text;
    this.addMessage('assistant', assistantMessage);

    return assistantMessage;
  }

  getStats() {
    return {
      messageCount: this.messages.length,
      estimatedTokens: estimateMessagesTokens(this.messages),
      totalTrimmed: this.trimCount
    };
  }
}

// Usage
async function main() {
  const ctx = new ContextManager({
    maxTokens: 180000,
    systemPrompt: 'You are a helpful coding assistant. Be concise.'
  });

  // Simulate a long conversation
  const questions = [
    'What is a closure in JavaScript?',
    'Can you show me an example with a counter?',
    'How does this relate to the module pattern?',
    'What about ES6 classes vs closures?'
  ];

  for (const q of questions) {
    console.log(`\nUser: ${q}`);
    const answer = await ctx.chat(q);
    console.log(`Claude: ${answer.substring(0, 100)}...`);
    console.log('Stats:', ctx.getStats());
  }
}

main().catch(console.error);
Enter fullscreen mode Exit fullscreen mode

Run it

npm install @anthropic-ai/sdk
export ANTHROPIC_API_KEY=your_key_here
node context-manager.js
Enter fullscreen mode Exit fullscreen mode

Upgrading to accurate token counting

The length / 4 estimate works for most cases, but if you need precision, use Claude's token counting API:

async function countTokensAccurately(messages, systemPrompt) {
  const response = await client.messages.countTokens({
    model: 'claude-haiku-4-5',
    system: systemPrompt,
    messages: messages
  });
  return response.input_tokens;
}

// Use in trim() for exact counts — but this costs an API call per trim
// Only worth it for high-stakes production apps
Enter fullscreen mode Exit fullscreen mode

The gotcha: always trim BEFORE adding the user message

A common mistake is trimming AFTER adding the user message but BEFORE getting the response. If the user message itself is huge (pasted code, long document), you need to account for it before it pushes you over the limit.

The implementation above trims in addMessage() — so the check happens automatically whenever a message is added, including user messages.

Production pattern: summarize instead of truncate

For long-running conversations where history matters, consider summarizing old messages instead of discarding them:

async function summarizeOldMessages(messages) {
  const summary = await client.messages.create({
    model: 'claude-haiku-4-5',
    max_tokens: 256,
    messages: [{
      role: 'user',
      content: `Summarize this conversation history in 2-3 sentences, preserving key facts:\n\n${
        messages.map(m => `${m.role}: ${m.content}`).join('\n')
      }`
    }]
  });

  return [{
    role: 'user',
    content: `[Earlier conversation summary: ${summary.content[0].text}]`
  }, {
    role: 'assistant',
    content: 'Understood. I have context from our earlier conversation.'
  }];
}
Enter fullscreen mode Exit fullscreen mode

Replace the discarded messages with the summary pair. Users get continuity; your token budget stays clean.

What does this cost?

If you're building this on top of the raw Anthropic API, even a busy app with 100 conversations/day stays well under $5/month with Haiku pricing.

If you want to skip the infrastructure and just use a managed Claude API endpoint, SimplyLouie offers a flat $2/month developer tier — no per-token billing, just a fixed monthly cost. Good for prototypes and low-to-medium traffic apps.

Discussion

How do you handle context limits in your production Claude apps? Sliding window, summarization, or something else? I'm curious what patterns people have found work best at scale.

claude #ai #nodejs #tutorial

Top comments (0)