MCP Session Context: Why I Added Full Conversation History to My MCP Knowledge Server and What It Changed
Let me be honest with you — I've been building MCP servers for about 3 months now. I've learned a ton, written 78 articles about every aspect of MCP development, and I thought I had the basic architecture figured out.
So here's the thing: I always followed the standard pattern. My MCP knowledge server exposes some tools, the AI client calls the tools when it needs context from my notes, gets the relevant snippets, and that's it. Session context? That's the client's job, right? I'm just the tool provider.
Turns out I was dead wrong. And the mistake cost me about a week of debugging weird prompts where the AI kept forgetting what it already knew about my notes. Let me walk you through what happened, what I changed, and why you might want to do the same.
The Problem: AI Kept Re-reading the Same Notes
I was using my MCP knowledge base with Claude Desktop to help me work through a complex refactoring. I'd already asked it to pull three different articles from my notes about MCP error handling, authentication, and deployment. It had all the context.
Then I asked a follow-up question: "Given what we just read, can you help me design a better health check endpoint for my server?"
And what did it do? It called the search_knowledge tool again for "health check endpoint" and re-read the same notes it already had. Worse — it only pulled what matched that specific query, not the bigger picture we'd already built in the conversation.
Honestly, I get why. The client doesn't know what the tool already returned. Every tool call is stateless by default. But from my perspective as a user, it felt stupid. We already had this conversation. Why are we starting over?
The bigger issue was context window waste. I pay for context tokens. Reading the same snippet twice is just throwing money away. And sometimes the second search didn't even get the same context — it'd get different fragments, and the AI would lose the thread.
The "Duh" Moment: Why Not Include the Session Context in the Server?
I was sitting there drinking coffee staring at this problem, and it hit me. The MCP server knows what tools it has been asked to call in this session. It can keep a simple session memory. The client doesn't have to track everything — the server can cache the conversation history of tool outputs.
Wait, but shouldn't the client handle that? Let me tell you what I've changed my mind about:
- Clients don't know what's important — The server knows which tool outputs are knowledge that should persist for the whole session. The client just sees a blob of text.
- Different sessions, different contexts — If I'm working on two different conversations with the same server, they need two different context caches.
- Server can deduplicate automatically — If the AI asks the same question twice, the server can just return the cached result instantly, no need to hit the database again.
- It's not that hard — We're not talking about building ChatGPT here. We're just talking about storing an array of the last N tool responses in memory.
So I decided to build it. Let me show you the code.
The Implementation: Simple Session Cache in Java
Here's what I ended up with. It's really simple — like 80 lines of code total. That's the beauty of it.
First, the session record to hold the context:
import java.time.Instant;
import java.util.ArrayList;
import java.util.List;
public record McpSession(
String sessionId,
Instant createdAt,
List<SessionContextEntry> contextEntries
) {
public McpSession(String sessionId) {
this(sessionId, Instant.now(), new ArrayList<>());
}
public void addEntry(String toolName, String output, int maxTokens) {
// Trim if we get too big - keep the most recent entries
int currentTokens = estimateTokens(contextEntries);
while (currentTokens > maxTokens && !contextEntries.isEmpty()) {
currentTokens -= estimateTokens(contextEntries.remove(0));
}
contextEntries.add(new SessionContextEntry(toolName, output, Instant.now()));
}
public String getFullContext() {
StringBuilder sb = new StringBuilder();
sb.append("# Previous conversation context from knowledge base:\n\n");
for (int i = 0; i < contextEntries.size(); i++) {
SessionContextEntry entry = contextEntries.get(i);
sb.append("## [")
.append(i + 1)
.append("] Tool: ")
.append(entry.toolName())
.append("\n\n")
.append(entry.output())
.append("\n\n---\n\n");
}
return sb.toString();
}
private int estimateTokens(List<SessionContextEntry> entries) {
return entries.stream()
.mapToInt(e -> e.output().length() / 4) // rough estimate: 4 chars ~ 1 token
.sum();
}
}
record SessionContextEntry(String toolName, String output, Instant timestamp) {}
Then a simple service to hold active sessions:
import org.springframework.stereotype.Service;
import java.time.Instant;
import java.util.Map;
import java.util.UUID;
import java.util.concurrent.ConcurrentHashMap;
@Service
public class McpSessionManager {
private final Map<String, McpSession> sessions = new ConcurrentHashMap<>();
private final int maxTokensPerSession = 4000; // adjust based on your context budget
public McpSession getOrCreateSession(String sessionId) {
// Some clients send session id in the meta field
if (sessionId == null || sessionId.isEmpty()) {
sessionId = UUID.randomUUID().toString();
}
return sessions.computeIfAbsent(sessionId, McpSession::new);
}
public void addContext(String sessionId, String toolName, String output) {
McpSession session = getOrCreateSession(sessionId);
session.addEntry(toolName, output, maxTokensPerSession);
}
public String getFullContextForPrompt(String sessionId) {
return getOrCreateSession(sessionId).getFullContext();
}
public void clearSession(String sessionId) {
sessions.remove(sessionId);
}
// Cleanup old sessions older than 24 hours - run this on a schedule
public void cleanupOldSessions() {
Instant oneDayAgo = Instant.now().minusSeconds(86400);
sessions.entrySet().removeIf(entry ->
entry.getValue().createdAt().isBefore(oneDayAgo));
}
}
Then we need to modify our tool call endpoint to include the cached context. Here's the relevant part of the controller:
import org.springframework.web.bind.annotation.PostMapping;
import org.springframework.web.bind.annotation.RequestBody;
import org.springframework.web.bind.annotation.RestController;
import com.fasterxml.jackson.databind.JsonNode;
@RestController
public class McpController {
private final McpSessionManager sessionManager;
private final KnowledgeSearchService searchService;
public McpController(McpSessionManager sessionManager, KnowledgeSearchService searchService) {
this.sessionManager = sessionManager;
this.searchService = searchService;
}
@PostMapping("/mcp/call")
public McpResponse callTool(@RequestBody McpCallRequest request) {
// Extract session id from request meta - clients send this differently
String sessionId = extractSessionId(request.getMeta());
// Get the search results normally
SearchResult result = searchService.search(request.getParams());
String resultText = result.toMarkdown();
// Add this result to the session context cache
sessionManager.addContext(sessionId, request.getName(), resultText);
// Get the full accumulated context and append it to the current response
String fullContext = sessionManager.getFullContextForPrompt(sessionId);
// Return the combined context to the AI
return McpResponse.success(fullContext);
}
private String extractSessionId(JsonNode meta) {
if (meta == null || !meta.has("session_id")) {
return null;
}
return meta.get("session_id").asText(null);
}
}
Wait, that's really it? Yeah. That's basically all the code you need.
The Result: It Worked Better Than I Expected
I was expecting it to solve the "reading the same thing twice" problem. It did that — but it also solved a bunch of other problems I didn't even know I had.
What Got Better
No more duplicate token usage — Before, I'd burn 2-3k tokens re-reading the same notes in a long conversation. Now it's cached once, and that's it. I've cut my token usage for knowledge-heavy conversations by about 30%. That adds up.
AI keeps the thread much better — Because every response includes all the previous knowledge from the session, the AI can connect dots between different searches automatically. It remembers that we talked about authentication earlier when you ask it about deployment now.
Faster responses — When the AI asks the same question twice, we just return the cached version instantly. No need to run the search against the database again. Big win for cold starts.
It's optional — If the client doesn't send a session id, we just create a temporary one. If the client wants to handle context itself, it just sends a new session id every time. Doesn't break anything.
The Unexpected Downsides (Be Honest, Pros & Cons Right?)
Okay, it's not all perfect. Let me tell you what didn't work as well:
Pros:
- ✓ 30% token reduction in long conversations
- ✓ AI maintains context better across multiple tool calls
- ✓ Automatic deduplication of repeated queries
- ✓ Faster responses for cached queries
- ✓ Backward compatible - old clients still work
- ✓ Dead simple implementation (80 lines of code)
- ✓ Automatic cleanup of old sessions keeps memory usage in check
Cons:
- ✗ Uses some server memory for active sessions (but with 4k token limit per session and 24h cleanup, this is negligible for personal use — I'm talking a few MBs)
- ✗ If the client doesn't send consistent session id, you get a new cache every time (this is a client issue, not server issue — most MCP clients do send session info now)
- ✗ You have to think about your token budget — if you let it grow forever, you'll blow the context window (hence the trimming in
addEntry) - ✗ Doesn't help across different sessions (which is what you want — different conversations should have different contexts)
Honestly, for a personal knowledge server, the pros completely outweigh the cons. I've been running it for two weeks now, and I haven't looked back.
What I Learned the Hard Way
I made a couple of mistakes along the way that you can avoid:
Mistake 1: I tried to be too smart about what to keep
First I tried to do "smart" summarization of old context to fit more in. I had the AI summarize what we already talked about and keep the summary instead of the full text.
That was stupid. It lost details, and it cost tokens to do the summarization. Just keep the full most recent entries and trim from the front when you hit your token limit. Simpler is better. The AI doesn't need the oldest context that badly anyway.
Mistake 2: I forgot about trimming
I didn't add the while loop that removes old entries at first. After a really long conversation with 20+ tool calls, I hit 10k tokens in the context and the client started complaining about context overflow. Oops. Always have a max limit and always trim.
Mistake 3: I didn't expose a clear tool to clear context
Sometimes you want to start over in the same conversation. Like "okay, we're done with that topic, let's talk about something completely different." I added a simple clear_session tool that the AI can call when it wants a fresh cache. Super simple, super useful.
Here's that tool if you want it:
public void clearSession(String sessionId) {
sessionManager.clearSession(sessionId);
return "Session context cleared successfully. Starting fresh with next tool calls.";
}
When Should You Do This?
This pattern isn't for every MCP server. Let me be clear:
Do this if:
- You're building a knowledge server / search server that gets called multiple times in the same conversation
- You want to reduce token usage
- You want the AI to have better accumulated context
- It's a personal / low-traffic server
Don't bother if:
- Your server is a single-purpose tool that gets called once per task (like "convert this image" or "send this email")
- You're building a public high-traffic server where memory usage matters at scale
- Your tool outputs are huge and don't need to be kept around for the whole conversation
For my use case — personal knowledge base that I'm querying all the time in long conversations with AI — this was a game-changer. I can't believe I didn't think of it earlier.
Try It Yourself
The code I showed you is basically what's running in production on my Papers knowledge base right now. It's simple, it works, and it didn't require changing any of the rest of my code.
If you've built an MCP server that gets called multiple times in conversations, give it a shot. Add a little session context cache. I bet you'll notice the difference immediately.
The full project is open source if you want to see the complete implementation: https://github.com/kevinten10/Papers
What's Your Take?
I'm still experimenting with this. Some questions I'm still thinking about:
- Should the server ever summarize older context to fit more in, or is simple FIFO trimming better?
- Do your MCP clients consistently send session IDs, or is that still hit or miss?
- Have you tried something similar? What worked for you and what didn't?
I'd love to hear your experience in the comments. Are you building MCP servers? What's one simple change that made a big difference for you?
Top comments (0)