KevinTen

Posted on Jun 25

MCP Monitoring: What I Learned Building a Production MCP Server — You need Metrics, Not Just Logs

#ai #opensource #mcp

MCP Monitoring: What I Learned Building a Production MCP Server — You need Metrics, Not Just Logs

Let me tell you a secret.

I've built 10+ MCP servers in the past 3 months. I've debugged more CORS issues than I care to remember, I've fought with rate limiting, I've fixed logging problems that took me 8 hours to reproduce, I've even converted legacy OpenAPI APIs to MCP in 50 lines of code.

But honestly? I thought monitoring was overkill.

" It's just a side project, " I told myself. "Logs are enough. If something breaks, I'll see it in the logs."

Then last week something broke. And I spent 4 hours digging through logs trying to figure out what went wrong.

That's when I learned the hard way: you need metrics, not just logs. Today I want to share what I learned building proper monitoring for my production MCP server, including all the code you can steal for your own server.

The Problem: When Logs Aren't Enough

Let me set the scene. I've been running my Papers MCP server for a couple months now. It's my personal knowledge base that lets any AI client query my 1,800 hours of notes. It works great 99% of the time.

Last Tuesday morning, I open Claude Desktop, connect to my MCP server, and ask it a question about "MCP best practices".

Nothing. Just spins. Then a timeout.

"Okay, restart the server." I think. I restart. Still times out.

Now I'm panicking a little. I check the logs. What do I see? Thousands of lines of info logs. Some debug messages. But where's the actual error?

Turns out one of my database queries started taking 30+ seconds because the table got bigger and I forgot to create an index. But with just logs, I had to manually go through thousands of lines to find the slow query. If I had metrics, I would've seen it immediately — the query latency graph would've spiked right when the problem started.

That day I promised myself I'd never do this to myself again. So I built proper monitoring. Here's what I learned.

What You Actually Need for MCP Monitoring

Honestly, you don't need a fancy expensive monitoring stack for a personal MCP server. But you do need these four things:

Request counting — how many requests are you getting, from which clients
Latency tracking — how long are your tools taking to respond
Error tracking — how many errors, what kind of errors
Basic health checks — is your server up, is your database connected

That's it. You don't need distributed tracing unless you're running a distributed system. For 99% of MCP servers (especially side projects), this is more than enough.

And the best part? You can add all this to your existing Spring Boot MCP server in about 50 lines of code. Let me show you.

The Implementation: Step by Step

I'm using Java Spring Boot for my MCP server, but this pattern works anywhere — Go, Python, Node, whatever. The concepts are the same.

First, we need a filter that wraps every request to collect metrics. Here's what that looks like:

@Component
public class McpMetricsFilter implements Filter {

    private final MeterRegistry meterRegistry;

    public McpMetricsFilter(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) 
            throws IOException, ServletException {

        HttpServletRequest httpRequest = (HttpServletRequest) request;
        String path = httpRequest.getRequestURI();

        // Only track MCP endpoints
        if (!path.startsWith("/mcp/")) {
            chain.doFilter(request, response);
            return;
        }

        long startTime = System.currentTimeMillis();
        Timer.Sample sample = Timer.start(meterRegistry);

        try {
            // Increment request counter
            meterRegistry.counter("mcp.requests.total", "endpoint", path)
                    .increment();

            chain.doFilter(request, response);

            int status = ((HttpServletResponse) response).getStatus();
            // Record latency
            sample.stop(Timer.builder("mcp.request.duration")
                    .tag("endpoint", path)
                    .tag("status", String.valueOf(status))
                    .register(meterRegistry));

            // Count status codes
            meterRegistry.counter("mcp.responses.total", 
                    "status", String.valueOf(status))
                    .increment();

            if (status >= 400) {
                meterRegistry.counter("mcp.errors.total", 
                        "status", String.valueOf(status))
                        .increment();
            }

        } catch (Exception e) {
            meterRegistry.counter("mcp.exceptions.total", 
                    "type", e.getClass().getSimpleName())
                    .increment();
            throw e;
        } finally {
            long totalTime = System.currentTimeMillis() - startTime;
            // Log basic info for debugging
            if (totalTime > 10000) { // 10 seconds warning
                logger.warn(String.format(
                    "Slow request: %s took %dms", 
                    path, totalTime));
            }
        }
    }
}

This is basic Spring Boot Actuator stuff. If you're not using Spring Boot, you can do the same thing with whatever metrics library your framework has — literally every modern framework has something like this.

Next, we need to track per-tool metrics, because not all MCP tools are equal. Your tools/list tool should be fast, but your search_knowledge tool might take longer. Let's add a service aspect to track individual tool execution:

@Aspect
@Component
public class McpToolMetricsAspect {

    private final MeterRegistry meterRegistry;

    public McpToolMetricsAspect(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    @Around("execution(* io.github.kevinten10.papers.mcp.service.*.callTool(..))")
    public Object trackToolCall(ProceedingJoinPoint joinPoint) throws Throwable {
        String toolName = (String) joinPoint.getArgs()[0];
        Timer.Sample sample = Timer.start(meterRegistry);

        try {
            Object result = joinPoint.proceed();
            sample.stop(Timer.builder("mcp.tool.duration")
                    .tag("tool", toolName)
                    .tag("result", "success")
                    .register(meterRegistry));
            meterRegistry.counter("mcp.tool.calls.total", 
                    "tool", toolName, 
                    "result", "success")
                    .increment();
            return result;
        } catch (Exception e) {
            sample.stop(Timer.builder("mcp.tool.duration")
                    .tag("tool", toolName)
                    .tag("result", "error")
                    .register(meterRegistry));
            meterRegistry.counter("mcp.tool.calls.total", 
                    "tool", toolName, 
                    "result", "error")
                    .increment();
            meterRegistry.counter("mcp.tool.errors.total", 
                    "tool", toolName,
                    "error", e.getClass().getSimpleName())
                    .increment();
            throw e;
        }
    }
}

Boom. Now you can see exactly which tools are slow, which tools are failing, how often they're called. This is gold when something goes wrong.

Next, add database connection pool metrics. Most connection pools already expose these if you're using Actuator — but make sure you're tracking:

Active connections
Idle connections
Wait time for connections

MCP servers are usually long-lived, connection leaks will kill you slowly. I once had a connection leak that took me 2 weeks to notice because I wasn't watching this. By the time it crashed, I had to restart anyway.

If you're using HikariCP (which you should be in Spring Boot), it automatically exposes these metrics through Actuator. You just need to enable it:

# application.properties
management.endpoints.web.exposure.include=health,metrics,prometheus
management.metrics.tags.application=${spring.application.name:papers}
management.endpoint.health.show-details=always

Adding a Health Check Endpoint That Actually Helps

MCP clients need to know if your server is healthy. But most health checks just say "I'm up" — that's not enough. Your server can be up but your database can be down.

Here's what a good MCP health check looks like:

@Component
public class McpHealthIndicator implements HealthIndicator {

    private final DataSource dataSource;
    private final McpToolRegistry toolRegistry;

    public McpHealthIndicator(DataSource dataSource, McpToolRegistry toolRegistry) {
        this.dataSource = dataSource;
        this.toolRegistry = toolRegistry;
    }

    @Override
    public Health health() {
        Map<String, Object> details = new HashMap<>();
        boolean healthy = true;

        // Check database connection
        try (Connection conn = dataSource.getConnection()) {
            if (!conn.isValid(5)) {
                details.put("database", "connection invalid");
                healthy = false;
            } else {
                details.put("database", "ok");
            }
        } catch (Exception e) {
            details.put("database", "connection failed: " + e.getMessage());
            healthy = false;
        }

        // Check that all tools are registered
        int toolCount = toolRegistry.getAllTools().size();
        details.put("registered_tools", toolCount);
        if (toolCount == 0) {
            details.put("tools", "no tools registered");
            healthy = false;
        } else {
            details.put("tools", "ok");
        }

        // Count total requests since startup
        // You can get this from your metric registry
        details.put("server_start_time", ManagementFactory.getRuntimeMXBean()
                .getStartTime());

        if (healthy) {
            return Health.up().withDetails(details).build();
        } else {
            return Health.down().withDetails(details).build();
        }
    }
}

Now when you hit /actuator/health, you get:

{
  "status": "UP",
  "details": {
    "database": "ok",
    "registered_tools": 5,
    "tools": "ok",
    "server_start_time": 1698000000000
  }
}

If something's wrong, it tells you exactly what. No more guessing.

Pros and Cons: Is This Worth Adding?

Let's be honest — adding metrics takes a little time. Is it worth it for your side project?

Pros

You'll find problems before your client does — When I added this, I discovered one of my tools was 2x slower than I thought. Fixed it before anyone noticed.
Debugging is 10x faster — Instead of digging through thousands of log lines, you just look at the latency graph. Instantly see where the problem is.
It's cheap — Spring Boot Actuator + Prometheus is free for personal use. You don't need anything fancy.
You learn how your server actually behaves — I was surprised to see that tools/list gets called way more often than tools/call. That changed how I optimized it.

Cons

Adds a few dependencies — If you're trying to keep your project super lean, this adds some weight. But honestly, it's negligible for most projects.
You need somewhere to visualize the metrics — You need Prometheus + Grafana, which is another container to run. But for local development, that's totally fine.
Overkill for toy projects — If you're just building a toy MCP server to try it out, you probably don't need this. But if you're going to run it every day like I do, you need it.

Who should add this?

✅ Running any kind of production/personal daily use MCP server — Do it.
✅ Sharing your MCP server with other people — Definitely do it.
✅ Learning MCP and want to do it properly — Great practice.
❌ Just experimenting, gonna throw it away afterwards — Skip it for now.

The Real Lessons I Learned

After setting this all up, what changed?

Honestly, I sleep better at night. Before, when I'd get a timeout, I'd panic a little bit. Now I just open Grafana, check the latency graph, see exactly what's going on. It takes 2 minutes instead of 4 hours.

But here's the thing that surprised me most: I started optimizing the right things. I always thought search_knowledge was my slowest tool. Turns out tools/list is called 10x more often, and it was taking 200ms because it was rebuilding the tool list every time. Adding caching there made everything feel much snappier. I never would've known that without metrics.

Another surprise: errors happen when you don't expect them. I noticed that about 5% of my requests come from clients that send invalid JSON. That doesn't break anything, but it's still an error. Now I have a counter for that, and I can see if it's getting worse over time.

So here's my advice: If you're building an MCP server that you actually want to use for more than a week, spend an afternoon adding basic monitoring. You'll thank yourself later when something breaks at 10 PM and you don't have to spend half the night debugging.

What's Next?

I've been building MCP servers for 3 months now, and I've learned a lot the hard way. I started with a 2000-line AI search service, and now I run a complete MCP server in under 150 lines of working code. Every time I build one, I learn something new about what works and what doesn't.

Next on my list: Adding alerting. Right now I have to check the graphs manually. I want alerts when latency spikes or error rates go up. But honestly, for personal use, even what I have now is more than enough.

Your Turn

Have you built an MCP server? Did you add monitoring? What kind of problems have you run into that would've been easier to debug with proper metrics? Drop a comment below and let me know — I'm always curious to hear how other people are solving these problems.

And if you found this helpful, feel free to star the Papers project on GitHub — it's where I'm collecting all these lessons into a working MCP knowledge base server that you can use yourself.

DEV Community

MCP Monitoring: What I Learned Building a Production MCP Server — You need Metrics, Not Just Logs

MCP Monitoring: What I Learned Building a Production MCP Server — You need Metrics, Not Just Logs

The Problem: When Logs Aren't Enough

What You Actually Need for MCP Monitoring

The Implementation: Step by Step

Adding a Health Check Endpoint That Actually Helps

Pros and Cons: Is This Worth Adding?

Pros

Cons

Who should add this?

The Real Lessons I Learned

What's Next?

Your Turn

Top comments (0)