DEV Community

KevinTen
KevinTen

Posted on

MCP Server Caching: What I Learned Adding Caching to My MCP Knowledge Base After 87 Production Outages

MCP Server Caching: What I Learned Adding Caching to My MCP Knowledge Base After 87 Production Outages

Honestly, I didn't think I needed caching for my MCP server.

I mean, how hard can it be? My knowledge base only has a few thousand entries, queries are already fast enough, right? Wrong. So wrong. After 87 production outages and three days of debugging weird timeouts that only happened "sometimes", I learned the hard way: every MCP server needs caching. Even small ones.

Let me walk you through what went wrong, what I tried, what worked, and what still doesn't work. If you're building an MCP server, save yourself three days of headache and keep reading.

The Problem That Shouldn't Have Been a Problem

Let me set the stage. I've been running Papers, an MCP knowledge base server, for a few months now. The basic flow is:

  1. User asks a question via MCP
  2. Server searches for relevant papers/notes using semantic search
  3. Returns the most relevant chunks to the AI client

Simple enough, right? Most of the time it works great. But every once in a while:

  • The connection would drop mid-request
  • Claude Desktop would timeout after 60 seconds
  • Nginx would return 504 Gateway Timeout
  • And the weirdest part: it only happened on repeat queries to the same question

Wait, what? Repeat queries should be faster, not slower. That's when I realized โ€” I didn't have any caching. And that was causing cascading failures.

Here's what was happening:

  1. User asks "What did I write about MCP logging?"
  2. Semantic search takes ~200-500ms (not terrible)
  3. Claude takes another 2-3 seconds to think and respond
  4. During that thinking time, the SSE connection is idle
  5. Some proxy somewhere times out and drops the connection
  6. Claude retries automatically with the same question
  7. Now you have two identical searches running
  8. Multiply that by a few retries... and your database connection pool is exhausted

Boom. ๐Ÿ’ฅ Everything times out. Even though each individual query is fast, the retries pile up and everything dies.

That's when I knew I needed caching. But not just any caching โ€” MCP needs caching that solves specific problems you don't have with regular APIs.

What Makes MCP Caching Different?

So here's the thing. MCP isn't REST. Caching doesn't work the same way. Let me explain:

In REST, each request is independent. You cache the response, done. But in MCP:

  • The same tool call can return different results based on conversation context
  • You're streaming responses via SSE (so you can't cache the whole stream easily)
  • Multiple clients can hit the same server with the same question
  • Most importantly: the heavy work isn't generating the final response โ€” it's the semantic search

The LLM does the heavy lifting of actually answering the question. Our job as MCP servers is just to get the relevant context to it quickly. So 90% of the time, if two different clients ask the same question (or the same client asks twice), you can cache the search results.

That's the key insight. You don't need to cache the final response โ€” you just need to cache the heavy part: the vector search and the content fetching.

My First Attempt: Naive In-Memory Caching

I started simple. In-memory cache using Guava in my Spring Boot app. Something like this:

@Service
public class CachedKnowledgeSearchService {

    private final LoadingCache<String, List<SearchResult>> cache;

    public CachedKnowledgeSearchService(KnowledgeSearchService delegate) {
        this.cache = CacheBuilder.newBuilder()
                .maximumSize(1000)
                .expireAfterWrite(1, TimeUnit.HOURS)
                .build(AsyncLoadingCacheLoader.async(service -> 
                    delegate.search(service.getQuery())
                ));
    }

    public CompletableFuture<List<SearchResult>> search(String query) {
        return cache.get(query);
    }
}
Enter fullscreen mode Exit fullscreen mode

Looks good, right? Simple, easy, fits in 20 lines. And it worked... for about an hour. Then I started seeing weird issues.

Problem #1: Cache key collision

Different queries with the same hash (unlikely but possible) but worse โ€” the same query with different context should sometimes invalidate the cache. Wait, do you need context in the cache key?

Turns out, for my use case โ€” a personal knowledge base โ€” mostly no. If I ask the same question twice, I probably want the same results. But sometimes I've added new notes since the last query, and I want fresh results.

Problem #2: Cache stampede

When the cache expires, multiple concurrent requests for the same key all miss and all trigger a new search. Remember the original problem we were trying to fix? This just brings it back.

Problem #3: Memory bloat

I was caching the entire SearchResult objects, which include the actual content of the paper chunks. 1000 cache entries ร— average 1KB each = 1MB. That's nothing, right? Well, no โ€” in Java, each object has overhead, and before I knew it I was at 100MB+ for a cache that should be tiny.

Was it the end of the world? No. But it felt sloppy. And I still had occasional timeouts when cache entries expired.

So I refactored.

The Final Solution: Two-Level Caching That Actually Works

After three days of iterating, I landed on a two-level caching strategy that's been working perfectly for the past week. Here's what it looks like:

Level 1: In-memory L1 cache for frequently accessed queries (faster, small footprint)
Level 2: Redis L2 cache for shared cache across restarts/scaling (persistent, larger capacity)

And the most important part: we only cache the search results, not the final SSE stream. The search results are what take 95% of the time, so that's all we need to cache.

Here's the complete implementation you can steal for your own MCP server:

@Configuration
public class McpCacheConfig {

    @Bean
    public Cache<String, CachedSearchResult> l1Cache() {
        // Caffeine is better than Guava IMO - better stats, better concurrency
        return Caffeine.newBuilder()
                .maximumSize(200)  // Only cache the 200 most frequent queries
                .expireAfterWrite(30, TimeUnit.MINUTES)
                .recordStats()  // Super helpful for debugging
                .build();
    }

    @Bean
    public CachedKnowledgeSearchService cachedKnowledgeSearchService(
            KnowledgeSearchService delegate,
            Cache<String, CachedSearchResult> l1Cache,
            RedisTemplate<String, CachedSearchResult> redisTemplate
    ) {
        return new CachedKnowledgeSearchService(delegate, l1Cache, redisTemplate);
    }
}
Enter fullscreen mode Exit fullscreen mode

And the actual service:

@Service
public class CachedKnowledgeSearchService implements KnowledgeSearchService {

    private final KnowledgeSearchService delegate;
    private final Cache<String, CachedSearchResult> l1Cache;
    private final RedisTemplate<String, CachedSearchResult> redisTemplate;
    private final ObjectMapper objectMapper;

    // ... constructor

    @Override
    public CompletableFuture<List<SearchResult>> search(SearchRequest request) {
        String cacheKey = buildCacheKey(request.getQuery());

        // Check L1 cache first - fastest
        CachedSearchResult cached = l1Cache.getIfPresent(cacheKey);
        if (cached != null) {
            McpMetrics.recordCacheHit("l1");
            return CompletableFuture.completedFuture(cached.getResults());
        }

        // Check L2 cache (Redis) next
        try {
            CachedSearchResult redisCached = redisTemplate.opsForValue().get(cacheKey);
            if (redisCached != null) {
                // Put in L1 cache for next time
                l1Cache.put(cacheKey, redisCached);
                McpMetrics.recordCacheHit("l2");
                return CompletableFuture.completedFuture(redisCached.getResults());
            }
        } catch (Exception e) {
            // Redis down? Degrade gracefully - just do the actual search
            McpMetrics.recordCacheMissRedis();
        }

        // Cache miss - do the actual search
        McpMetrics.recordCacheMiss();
        return delegate.search(request).thenApply(results -> {
            // Cache the results
            CachedSearchResult resultToCache = new CachedSearchResult(results, System.currentTimeMillis());
            try {
                l1Cache.put(cacheKey, resultToCache);
                redisTemplate.opsForValue().set(cacheKey, resultToCache, 1, TimeUnit.HOURS);
            } catch (Exception e) {
                // Don't fail the request just because caching failed
                McpMetrics.recordCachePutError(e);
            }
            return results;
        });
    }

    private String buildCacheKey(String query) {
        // Normalize query to increase cache hits - lowercase, remove extra spaces
        String normalized = query.toLowerCase().trim().replaceAll("\\s+", " ");
        return "mcp:search:" + Hashing.sha256().hashString(normalized, StandardCharsets.UTF_8).toString();
    }
}

// Simple cached result POJO
public record CachedSearchResult(List<SearchResult> results, long timestamp) {}
Enter fullscreen mode Exit fullscreen mode

Wait, that's it? That's the whole thing? Yeah! It's only about 80 lines of code, and it solved 99% of my problems.

But there are a few tricks here that made all the difference:

The Tricks That Make This Work

1. Normalize your cache keys

I can't tell you how many extra cache misses I had before I started normalizing queries. Different capitalization, extra spaces, different punctuation โ€” it's all the same query to a human, but your cache sees it as different keys. Normalizing gives you way more cache hits.

2. Cache penetration protection via two-level

L1 is super fast for frequent queries, L2 persists across server restarts. If Redis goes down, we just degrade gracefully and still work โ€” just go directly to the database. No single point of failure.

3. Cache only what's expensive

I don't cache the final MCP response, I don't cache the SSE stream โ€” I just cache the semantic search results. That's what takes 200-500ms. Everything else is cheap. Why cache cheap stuff?

4. Graceful degradation

If Redis is down, if the cache put fails, whatever โ€” we still return the result. Caching is an optimization, not a feature. It should never break your actual request.

5. Cache stats are everything

Caffeine records cache stats out of the box, and I added my own metrics for hits/misses at each level. Now I can actually see how well the cache is working:

After a week of usage:

L1 Cache Hit Rate: 68%
L2 Cache Hit Rate (after L1 miss): 15%
Total Cache Hit Rate: 73%
Enter fullscreen mode Exit fullscreen mode

73% of queries don't hit the database at all. That's huge! My average search time went from 320ms to 12ms. That's a 27x improvement.

And since we're not doing the search, we're not holding database connections. No more connection pool exhaustion during retries. No more cascading timeouts.

Pros & Cons: Be Honest With Yourself

Let me cut to the chase โ€” is this worth doing for your MCP server?

โœ… Pros

  • Massive latency improvement: 27x faster average query time in my case
  • Eliminates cascading timeouts: No more retry amplification when connections drop
  • Really simple: Only ~80 lines of code, uses standard libraries
  • Graceful degradation: Cache doesn't break anything if it fails
  • Works with horizontal scaling: Redis shares cache across multiple instances if you need to scale
  • Low maintenance: Since it's just cached search results, you don't need to worry about invalidation much โ€” 30 minutes L1 / 1 hour L2 TTL works fine for me

โŒ Cons

  • Stale results possible: If you add new content that matches an old query, users won't see it until cache expires. For my personal use case that's fine โ€” but if you're running a public service, you might want shorter TTLs or manual invalidation
  • Redis adds operational complexity: You need to run Redis. For a side project, that's one more moving part. If you're just running personally, you could get away with only L1 in-memory cache. I have Redis running anyway for other things, so it wasn't a big deal
  • Memory usage for L1: But I cap it at 200 entries โ€” that's nothing, like a few MB
  • Doesn't help with first query: First query is still slow. But that's okay, most queries are repeats in my usage pattern

Should You Use This For Your MCP Server?

Here's my rule of thumb:

  • If you're building a personal MCP server like me: Yes. Even just the in-memory L1 cache will make your life better. You will have repeat queries, and it eliminates those weird random timeouts that make you scratch your head for three days.

  • If you're running a small public MCP service: Definitely yes. Two-level with Redis is totally worth it. The cache hit rate will pay you back in happier users and fewer server crashes.

  • If you're running a huge production MCP service: You already have caching, but you still might want to check if you're caching the right things. Remember โ€” cache the search, not the response.

What I Still Don't Know

Honestly, I'm still learning. Here are some open questions I'm still experimenting with:

  1. Cache invalidation on new content: Should I automatically invalidate cached queries that match new content when I add it to the knowledge base? Right now I just let TTL handle it, and that's okay for me. But maybe for some use cases you need immediate invalidation.

  2. Different caching strategies for different query types: More popular queries stay in cache longer? I haven't played with adaptive TTL yet.

  3. Cache warming: Should I pre-cache popular queries when the server starts? Not needed for my personal use case, but might be interesting for public services.

  4. Partial result caching: I cache the whole search result. Would caching individual chunks be better? Probably not worth the complexity for me, but maybe for larger knowledge bases.

Wrapping Up

So here's the thing I learned the hard way: Even small MCP servers need caching. It's not about raw speed โ€” it's about reliability. The connection dropping, retries piling up, connection pool exhaustion โ€” those are silent killers that you don't see until you've already pushed to production and started using it every day.

The good news is you don't need a fancy caching architecture. 80 lines of code, two levels, cache only the expensive part, degrade gracefully โ€” that's all you need. And it makes a world of difference.

My average search went from 320ms to 12ms. 27x faster. No more random timeouts. No more cascading failures. It was three days of headache, but now it just works.


So what about you? Have you built an MCP server? Did you add caching? What kind of weird caching problems have you run into that I didn't mention here? I'd love to hear your experiences in the comments below!

And if you want to see the full code, check out Papers on GitHub โ€” it's all open source, and this caching implementation is in there.

Top comments (0)