How to implement Claude conversation history without storing everything (token-efficient pattern)
One of the most common mistakes developers make when building Claude-powered apps is storing the full conversation history and sending it every request. This works fine for a few messages — then your token costs explode.
Here's the pattern I use to keep conversation history functional while keeping token spend under control.
The problem
Claude's API is stateless. Every request must include the full conversation so Claude has context. Naive implementation:
const messages = [];
async function chat(userMessage) {
messages.push({ role: 'user', content: userMessage });
const response = await anthropic.messages.create({
model: 'claude-opus-4-5',
max_tokens: 1024,
messages: messages // sending EVERYTHING every time
});
messages.push({ role: 'assistant', content: response.content[0].text });
return response.content[0].text;
}
After 20 exchanges, you're sending 20 messages worth of tokens every single request. A 30-minute conversation can cost more than you'd expect.
The sliding window pattern
Keep only the last N exchanges in the active window:
const WINDOW_SIZE = 10; // keep last 10 messages (5 exchanges)
const allMessages = [];
async function chat(userMessage) {
allMessages.push({ role: 'user', content: userMessage });
// Only send the recent window
const window = allMessages.slice(-WINDOW_SIZE);
const response = await anthropic.messages.create({
model: 'claude-opus-4-5',
max_tokens: 1024,
system: 'You are a helpful assistant. You may not have full conversation history.',
messages: window
});
const assistantMessage = response.content[0].text;
allMessages.push({ role: 'assistant', content: assistantMessage });
return assistantMessage;
}
This caps your per-request token cost regardless of conversation length.
The summarization pattern (better for long sessions)
Instead of dropping history, periodically summarize it:
const SUMMARIZE_THRESHOLD = 20; // summarize after 20 messages
let conversationSummary = '';
let recentMessages = [];
async function summarizeHistory(messages) {
const historyText = messages
.map(m => `${m.role}: ${m.content}`)
.join('\n');
const response = await anthropic.messages.create({
model: 'claude-haiku-4-5', // use cheaper model for summarization
max_tokens: 500,
messages: [{
role: 'user',
content: `Summarize this conversation in 2-3 sentences, preserving key facts and context:\n\n${historyText}`
}]
});
return response.content[0].text;
}
async function chat(userMessage) {
recentMessages.push({ role: 'user', content: userMessage });
// Summarize when threshold hit
if (recentMessages.length >= SUMMARIZE_THRESHOLD) {
conversationSummary = await summarizeHistory(recentMessages);
recentMessages = recentMessages.slice(-4); // keep last 2 exchanges
}
// Build messages with summary as context
const systemPrompt = conversationSummary
? `You are a helpful assistant. Earlier conversation summary: ${conversationSummary}`
: 'You are a helpful assistant.';
const response = await anthropic.messages.create({
model: 'claude-opus-4-5',
max_tokens: 1024,
system: systemPrompt,
messages: recentMessages
});
const assistantMessage = response.content[0].text;
recentMessages.push({ role: 'assistant', content: assistantMessage });
return assistantMessage;
}
Token counting before you send
Claude's API returns usage in every response. Track it:
async function chatWithTracking(userMessage) {
// ... build messages ...
const response = await anthropic.messages.create({
model: 'claude-opus-4-5',
max_tokens: 1024,
messages: messages
});
console.log(`Input tokens: ${response.usage.input_tokens}`);
console.log(`Output tokens: ${response.usage.output_tokens}`);
console.log(`Total: ${response.usage.input_tokens + response.usage.output_tokens}`);
return response.content[0].text;
}
Add this to development and you'll immediately see which conversations are getting expensive.
The flat-rate escape hatch
All of this complexity exists because per-token billing creates anxiety. You implement rate limiters, sliding windows, summarization pipelines — all to avoid a surprise bill.
I got tired of the engineering overhead and switched to SimplyLouie — a flat-rate Claude wrapper at $2/month. No token counting, no sliding windows, no summarization logic. I just send messages.
For production apps where you need fine-grained control, build the patterns above. For personal projects and prototyping, flat-rate removes the entire problem.
Which pattern should you use?
| Use case | Pattern |
|---|---|
| Short sessions (<10 messages) | Full history (naive approach is fine) |
| Medium sessions (10-30 messages) | Sliding window |
| Long sessions (30+ messages) | Summarization |
| Prototyping / personal use | Flat-rate wrapper |
The summarization pattern is the most sophisticated but adds latency and cost for the summarization calls themselves. For most apps, a sliding window of 10-20 messages hits the sweet spot.
What pattern are you using in production? Drop it in the comments — especially curious if anyone's doing something clever with embeddings for semantic memory.
Building Claude apps without the per-token anxiety? SimplyLouie is $2/month flat. 7-day free trial, no commitment.
Top comments (0)