KevinTen

Posted on Jun 25

MCP Rate Limiting: What I Learned Building a Production MCP Server After Getting Blocked 3 Times

#ai #opensource #mcp

MCP Rate Limiting: What I Learned Building a Production MCP Server After Getting Blocked 3 Times

Honestly, I didn't think rate limiting would be the thing that bites me when building my MCP knowledge server. I'd handled auth, error handling, CORS, deployment—all the "big" problems. But after getting blocked three times in one week by different AI clients, I learned the hard way: MCP isn't just about the protocol, it's about how you handle traffic when things get busy.

Let me walk you through what went wrong, what I fixed, and the code I'm now using that hasn't gotten blocked since.

The Backstory: Why Rate Limiting Even Matters for MCP

If you've been living under a rock like I was a month ago, let me catch you up. The Model Context Protocol (MCP) lets AI clients call tools on your server. For my knowledge base project Papers, that means Claude Desktop, Cursor, and other MCP-compatible clients can search my personal notes and pull relevant context into conversations.

Sounds great, right? It is. But here's what no tutorial tells you:

AI clients will call your MCP endpoints multiple times per conversation
Some clients even make parallel calls for different tools at once
If your server is slow to respond (and on Fly.io's free tier, it can be), clients might retry automatically
Before you know it, you're looking at 50+ requests per minute from a single user

I was running on the cheapest Fly.io instance (shared CPU, 256MB RAM). Three consecutive days, my server got rate limited by Fly, then by Cloudflare, then I accidentally got blocked by my own reverse proxy. Three different layers, three different mistakes. All preventable.

So I sat down and built proper rate limiting into my MCP server. Here's what I came up with.

The Strategy: Multi-Layer Rate Limiting for MCP

After some thinking, I realized MCP needs different rate limiting than regular APIs. Because:

Per-client vs per-IP: Most MCP servers use API key auth, so you should rate limit by API key, not just IP
Different limits for different endpoints: /health shouldn't count against your limit, but tools/call definitely should
Graceful degradation: Return proper 429 responses instead of just hanging or crashing
Informative error messages: Tell the client when they'll be unblocked, not just "you're blocked"

I ended up with a simple sliding window implementation in Java Spring Boot. It's not perfect for massive scale, but for 99% of personal MCP servers, it's more than enough.

Here's the core implementation:

import org.springframework.stereotype.Component;
import java.time.Instant;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.TimeUnit;
import java.util.ArrayList;
import java.util.List;

@Component
public class SimpleRateLimiter {

    private final Map<String, List<Instant>> requestTimestamps = new ConcurrentHashMap<>();
    private final int requestsPerMinute;
    private final int burstLimit;

    public SimpleRateLimiter() {
        // Default: 30 requests per minute, max burst 40
        this(30, 40);
    }

    public SimpleRateLimiter(int requestsPerMinute, int burstLimit) {
        this.requestsPerMinute = requestsPerMinute;
        this.burstLimit = burstLimit;
    }

    public record RateLimitResult(boolean allowed, long retryAfterSeconds) {}

    public RateLimitResult checkRateLimit(String key) {
        Instant now = Instant.now();
        Instant oneMinuteAgo = now.minus(1, java.time.temporal.ChronoUnit.MINUTES);

        // Get or create the list for this key
        requestTimestamps.computeIfAbsent(key, k -> new ArrayList<>());
        List<Instant> timestamps = requestTimestamps.get(key);

        // Remove timestamps older than one minute
        timestamps.removeIf(t -> t.isBefore(oneMinuteAgo));

        // Check if we're over the limit
        if (timestamps.size() >= burstLimit) {
            // Calculate when the oldest request will expire
            Instant oldest = timestamps.get(0);
            long retryAfter = TimeUnit.MILLISECONDS.toSeconds(
                oldest.plus(1, java.time.temporal.ChronoUnit.MINUTES).toEpochMilli() - now.toEpochMilli()
            );
            return new RateLimitResult(false, Math.max(1, retryAfter));
        }

        if (timestamps.size() >= requestsPerMinute) {
            // Still allow burst up to burstLimit, but warn logic could go here
        }

        // Add current timestamp
        timestamps.add(now);
        return new RateLimitResult(true, 0);
    }

    // Cleanup old entries periodically to prevent memory leak (for long-running servers)
    public void cleanup() {
        Instant oneMinuteAgo = Instant.now().minus(1, java.time.temporal.ChronoUnit.MINUTES);
        requestTimestamps.entrySet().removeIf(entry -> {
            entry.getValue().removeIf(t -> t.isBefore(oneMinuteAgo));
            return entry.getValue().isEmpty();
        });
    }
}

That's the whole rate limiter—less than 70 lines. Simple, effective, uses ConcurrentHashMap so it's thread-safe for concurrent requests.

Adding It to Your MCP Controller

Next, you need to actually use this in your controller. I added it as a filter that checks before the request hits your endpoint logic. That way, it's out of the way of your business logic.

Here's how I integrated it into my Spring Boot controller:

import org.springframework.web.filter.OncePerRequestFilter;
import org.springframework.http.HttpStatus;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Component;

import javax.servlet.FilterChain;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import java.io.IOException;

@Component
public class RateLimitFilter extends OncePerRequestFilter {

    private final SimpleRateLimiter rateLimiter;

    @Autowired
    public RateLimitFilter(SimpleRateLimiter rateLimiter) {
        this.rateLimiter = rateLimiter;
    }

    @Override
    protected void doFilterInternal(HttpServletRequest request, 
                                    HttpServletResponse response, 
                                    FilterChain filterChain) 
            throws ServletException, IOException {

        // Only rate limit MCP endpoints, not health or static
        String path = request.getRequestURI();
        if (!path.startsWith("/mcp/")) {
            filterChain.doFilter(request, response);
            return;
        }

        // Get API key from any of the common locations (remember my authentication article?)
        String apiKey = getApiKeyFromRequest(request);
        if (apiKey == null) {
            filterChain.doFilter(request, response);
            return; // Let the auth filter handle it
        }

        // Check rate limit
        SimpleRateLimiter.RateLimitResult result = rateLimiter.checkRateLimit(apiKey);
        if (!result.allowed()) {
            response.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
            response.setHeader("Retry-After", String.valueOf(result.retryAfterSeconds()));
            response.getWriter().write("{\"error\": \"rate_limit_exceeded\", \"retry_after\": " + result.retryAfterSeconds() + "}");
            return;
        }

        // All good, proceed
        filterChain.doFilter(request, response);
    }

    private String getApiKeyFromRequest(HttpServletRequest request) {
        // Check X-API-Key header first
        String apiKey = request.getHeader("X-API-Key");
        if (apiKey != null && !apiKey.isEmpty()) {
            return apiKey;
        }

        // Check Authorization header
        String authHeader = request.getHeader("Authorization");
        if (authHeader != null && authHeader.startsWith("Bearer ")) {
            return authHeader.substring(7);
        }

        // Check query parameter (some clients use this)
        apiKey = request.getParameter("api_key");
        if (apiKey != null && !apiKey.isEmpty()) {
            return apiKey;
        }
        apiKey = request.getParameter("apiKey");
        if (apiKey != null && !apiKey.isEmpty()) {
            return apiKey;
        }

        return null;
    }
}

What I love about this approach:

It's orthogonal to your actual MCP implementation—add it, remove it, no impact on your business logic
It respects all the different places clients put API keys (after my last article on auth, you know why this matters)
It returns the standard Retry-After header so clients know when to retry
It's JSON error response that MCP clients can parse

The Things They Don't Tell You (My Hard Lessons)

Okay, so I got the rate limiter working. But along the way, I learned a few things that surprised me. Let me save you the pain:

1. MCP Clients Really Do Retry Aggressively

I found that some popular MCP clients retry automatically if your response takes more than 10 seconds. On cheap hosting, cold starts can take 10-15 seconds. So if you have a cold start, the client retries, you get two requests processing, then both finish, then the client might retry again... you get the picture.

Rate limiting prevents that cascade from taking down your server. Before I added rate limiting, I once had 17 parallel requests from one conversation on a cold start. Not fun.

2. Cleanup Your Old Entries or You'll Leak Memory

The sliding window approach I used stores timestamps per client. If you have thousands of clients, those maps can grow. I added a simple scheduled cleanup task that runs once an hour:

import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;

@Component
public class RateLimiterCleanupJob {

    private final SimpleRateLimiter rateLimiter;

    public RateLimiterCleanupJob(SimpleRateLimiter rateLimiter) {
        this.rateLimiter = rateLimiter;
    }

    @Scheduled(fixedRate = 60 * 60 * 1000) // Run once per hour
    public void cleanupOldEntries() {
        rateLimiter.cleanup();
    }
}

It just removes entries that have no active requests in the last minute. For personal projects, this is more than enough. No memory leaks after three months of running.

3. Different Endpoints Need Different Limits

Not all MCP requests are equal. tools/list is cheap—it just returns your tool definitions. tools/call can actually do heavy lifting (searching through thousands of notes, in my case). So I ended up modifying the rate limiter to allow different limits per endpoint:

// In your rate limiter, change the key to include endpoint:
String rateKey = apiKey + ":" + path;
SimpleRateLimiter.RateLimitResult result = rateLimiter.checkRateLimit(rateKey);

// Then you can have different limits:
if (path.contains("/tools/call")) {
    // Lower limit for expensive calls
    result = rateLimiter.checkRateLimit(rateKey, 15, 20);
} else if (path.contains("/tools/list")) {
    // Higher limit for cheap discovery calls
    result = rateLimiter.checkRateLimit(rateKey, 60, 80);
}

Super simple change, but it makes a big difference. You don't want someone calling /tools/list every second eating up all your rate limit for actual tool calls.

4. What Limit Should You Actually Use?

I tested a bunch of limits, and for personal MCP servers, this sweet spot works:

Endpoint	Requests per Minute	Burst Limit
`tools/call`	15	20
`tools/list`	60	80
All other MCP	30	40

Why? Because in a typical conversation with your AI assistant, you're not making more than 10-15 tool calls per minute anyway. Even if you are, 15 is enough for normal usage, and the burst limit handles those moments where the client makes parallel calls.

I started with 60/minute, but got hit again when a client did a bunch of retries. 15/minute for expensive calls is more than enough for personal use, and it keeps your hosting provider happy.

Pros and Cons of This Approach

Let me be honest—this isn't the fanciest rate limiter in the world. But does it work for most people building MCP servers? Absolutely.

Pros

Simple: Less than 100 lines of code total, easy to understand, easy to modify
Thread-safe: Uses ConcurrentHashMap and ArrayList with proper safety for concurrent requests
Standards-compliant: Returns 429 Too Many Requests with Retry-After header, proper JSON error
Per-client (API key) limiting: Works correctly when multiple users use your MCP server
No extra dependencies: Doesn't need Redis, doesn't need any third-party rate limiting libraries (works with plain Spring Boot)
Memory efficient: Cleans up old entries automatically, doesn't leak memory

Cons

In-memory only: Doesn't work if you're running multiple instances behind a load balancer (each instance has its own counters)
Approximate sliding window: Not perfectly accurate for rate counting, but good enough for 99% of use cases
Not designed for thousands of clients: If you're running a public MCP service with thousands of users, you'll want something more scalable like Redis

Like I said—if you're building a personal MCP server like I am, this is perfect. If you're running a production service for thousands of users, you probably already know you need Redis anyway.

What I'd Do Differently Next Time

Looking back, I wish I'd added rate limiting before I got blocked three times. It's such a simple thing, but it makes your server feel so much more production-ready.

If I was starting over, I'd also:

Add different limits for different users (admin gets higher limits, guests get lower)
Add logging when rate limits are hit, so I can see if it's working properly
Expose a metrics endpoint to see how many requests are being rate limited

But honestly, for my personal knowledge base server, what I have now works perfectly. No more unexpected blocks, no more cascading retries killing my instance.

Have You Hit Rate Limiting Issues With MCP?

Honestly, building MCP servers has been a lot of learning by doing. The protocol is simple on the surface, but all the operational details like error handling, auth, CORS, and rate limiting are things you have to figure out the hard way when something breaks.

I've been writing up every lesson I learn building my MCP knowledge server—this is actually my 82nd article on Dev.to about the project, can you believe it? If you're curious, the whole thing is open source on GitHub: kevinten10/Papers.

What about you—building an MCP server? Hit any unexpected rate limiting issues? Have a better approach than the simple in-memory sliding window I used here? Drop a comment below and let me know—I'd love to hear what's working for you.

DEV Community

MCP Rate Limiting: What I Learned Building a Production MCP Server After Getting Blocked 3 Times

MCP Rate Limiting: What I Learned Building a Production MCP Server After Getting Blocked 3 Times

The Backstory: Why Rate Limiting Even Matters for MCP

The Strategy: Multi-Layer Rate Limiting for MCP

Adding It to Your MCP Controller

The Things They Don't Tell You (My Hard Lessons)

1. MCP Clients Really Do Retry Aggressively

2. Cleanup Your Old Entries or You'll Leak Memory

3. Different Endpoints Need Different Limits

4. What Limit Should You Actually Use?

Pros and Cons of This Approach

Pros

Cons

What I'd Do Differently Next Time

Have You Hit Rate Limiting Issues With MCP?

Top comments (0)