How to build a Claude AI context manager that never hits token limits
If you've ever built a Claude chatbot, you've hit this wall: conversation history grows until you exceed the context window and the API throws an error.
Most tutorials ignore this problem. Production apps can't.
Here's a complete context manager that keeps conversations within token limits — automatically trimming old messages while preserving the system prompt and recent context.
The problem
Claude Haiku has a 200K token context window. That sounds massive, but a long conversation with detailed responses can fill it faster than you'd expect. When it fills:
AnthropicError: prompt is too long: 201847 tokens > 200000 maximum
Your app crashes. The user loses their conversation. Bad.
The solution: sliding window context management
const Anthropic = require('@anthropic-ai/sdk');
const client = new Anthropic({
apiKey: process.env.ANTHROPIC_API_KEY
});
// Rough token estimator (Claude uses ~4 chars per token on average)
function estimateTokens(text) {
return Math.ceil(text.length / 4);
}
function estimateMessagesTokens(messages) {
return messages.reduce((total, msg) => {
return total + estimateTokens(msg.content) + 4; // 4 tokens overhead per message
}, 0);
}
class ContextManager {
constructor(options = {}) {
this.maxTokens = options.maxTokens || 180000; // Leave 20K buffer for response
this.systemPrompt = options.systemPrompt || 'You are a helpful assistant.';
this.messages = [];
this.trimCount = 0;
}
addMessage(role, content) {
this.messages.push({ role, content });
this.trim();
}
trim() {
const systemTokens = estimateTokens(this.systemPrompt);
let available = this.maxTokens - systemTokens;
// Count from the end (keep most recent messages)
let keepFrom = this.messages.length;
let totalTokens = 0;
for (let i = this.messages.length - 1; i >= 0; i--) {
const msgTokens = estimateTokens(this.messages[i].content) + 4;
if (totalTokens + msgTokens > available) {
keepFrom = i + 1;
break;
}
totalTokens += msgTokens;
keepFrom = i;
}
if (keepFrom > 0) {
const trimmed = keepFrom;
this.messages = this.messages.slice(keepFrom);
this.trimCount += trimmed;
console.log(`Trimmed ${trimmed} messages (${this.trimCount} total trimmed)`);
}
}
async chat(userMessage) {
this.addMessage('user', userMessage);
const response = await client.messages.create({
model: 'claude-haiku-4-5',
max_tokens: 1024,
system: this.systemPrompt,
messages: this.messages
});
const assistantMessage = response.content[0].text;
this.addMessage('assistant', assistantMessage);
return assistantMessage;
}
getStats() {
return {
messageCount: this.messages.length,
estimatedTokens: estimateMessagesTokens(this.messages),
totalTrimmed: this.trimCount
};
}
}
// Usage
async function main() {
const ctx = new ContextManager({
maxTokens: 180000,
systemPrompt: 'You are a helpful coding assistant. Be concise.'
});
// Simulate a long conversation
const questions = [
'What is a closure in JavaScript?',
'Can you show me an example with a counter?',
'How does this relate to the module pattern?',
'What about ES6 classes vs closures?'
];
for (const q of questions) {
console.log(`\nUser: ${q}`);
const answer = await ctx.chat(q);
console.log(`Claude: ${answer.substring(0, 100)}...`);
console.log('Stats:', ctx.getStats());
}
}
main().catch(console.error);
Run it
npm install @anthropic-ai/sdk
export ANTHROPIC_API_KEY=your_key_here
node context-manager.js
Upgrading to accurate token counting
The length / 4 estimate works for most cases, but if you need precision, use Claude's token counting API:
async function countTokensAccurately(messages, systemPrompt) {
const response = await client.messages.countTokens({
model: 'claude-haiku-4-5',
system: systemPrompt,
messages: messages
});
return response.input_tokens;
}
// Use in trim() for exact counts — but this costs an API call per trim
// Only worth it for high-stakes production apps
The gotcha: always trim BEFORE adding the user message
A common mistake is trimming AFTER adding the user message but BEFORE getting the response. If the user message itself is huge (pasted code, long document), you need to account for it before it pushes you over the limit.
The implementation above trims in addMessage() — so the check happens automatically whenever a message is added, including user messages.
Production pattern: summarize instead of truncate
For long-running conversations where history matters, consider summarizing old messages instead of discarding them:
async function summarizeOldMessages(messages) {
const summary = await client.messages.create({
model: 'claude-haiku-4-5',
max_tokens: 256,
messages: [{
role: 'user',
content: `Summarize this conversation history in 2-3 sentences, preserving key facts:\n\n${
messages.map(m => `${m.role}: ${m.content}`).join('\n')
}`
}]
});
return [{
role: 'user',
content: `[Earlier conversation summary: ${summary.content[0].text}]`
}, {
role: 'assistant',
content: 'Understood. I have context from our earlier conversation.'
}];
}
Replace the discarded messages with the summary pair. Users get continuity; your token budget stays clean.
What does this cost?
If you're building this on top of the raw Anthropic API, even a busy app with 100 conversations/day stays well under $5/month with Haiku pricing.
If you want to skip the infrastructure and just use a managed Claude API endpoint, SimplyLouie offers a flat $2/month developer tier — no per-token billing, just a fixed monthly cost. Good for prototypes and low-to-medium traffic apps.
Discussion
How do you handle context limits in your production Claude apps? Sliding window, summarization, or something else? I'm curious what patterns people have found work best at scale.
Top comments (0)