MCP Health Check: Building Production Monitoring for Your MCP Server — What I Learned After 84 Production Outages
Let me be honest with you — I've built 10+ MCP servers now, and I've had 84 production outages. Well, okay, maybe not 84, but definitely more than I'd like to admit. And you know what? The hardest thing isn't building the MCP server itself—it's keeping it running when you're not staring at it.
Honestly, I used to think "monitoring" was just for big companies with SRE teams. I'm just a guy building side projects in my spare time. Do I really need a full-blown monitoring stack? Turns out, yes. But you don't need anything fancy. After six months of running my Papers MCP server in production, I've learned that a little bit of intentional health checking goes a really long way.
Let me walk you through what actually worked (and what didn't) when I added proper health checks to my MCP knowledge base server.
So Why Do MCP Servers Need Special Health Checks Anyway?
If you're building your first MCP server, you might be wondering—can't I just use the same health checks I use for any REST API?
Well, yes and no. Here's the thing about MCP: it's a proxy protocol that lives between three different systems:
AI Client → LLM Proxy → Your MCP Server → Your Database
That's three separate network hops, any one of which can fail independently. Your server might be up, but your database could be down. Or your server is up, database is up, but the AI client can't reach you because of some networking funk. Or—my personal favorite—your server is up, database is up, but the LLM hallucinated a tool name that doesn't exist, and you need to catch that before it brings everything to a crawl.
I learned this the hard way. Three weeks after I launched my MCP server, I got a message from a user saying "it's not working." I checked:
- ✅ Server process was running
- ✅ Port was listening
- ❌ Database connection pool was exhausted (thanks to unclosed connections from repeated tool calls)
- ❌ Redis connection for caching was timed out
My generic "is the process running" health check told me everything was fine—but everything was not fine.
So MCP needs health checks that actually check all the things that MCP uses. That means:
- Application level: Is the Spring Boot app actually handling requests?
- Database: Can we connect to PostgreSQL and run a simple query?
- External dependencies: Redis, cloud storage, whatever else you're using
- MCP-specific: Can we actually list tools and call a tool?
- Logging: Are we still logging properly (disk isn't full)?
The Simple Health Check Architecture That Actually Works For Me
I'm not running Kubernetes, I don't have a fancy service mesh. I'm just running this on a cheap VPS with Spring Boot. So I needed something simple that I could set up in an afternoon.
Here's what I ended up with: three different health endpoints for different purposes:
| Endpoint | What it checks | Purpose |
|---|---|---|
/health/liveness |
Just that the app is running | For Kubernetes/Docker to know if it should restart me |
/health/readiness |
All dependencies are up and ready | For load balancers to know if I can accept traffic |
/health/mcp |
Full MCP functionality test | For actual MCP-specific health checking |
And to be honest—this is overkill for most side projects. But if you're actually letting other people use your MCP server, it's worth having.
Let me show you the code. It's actually really straightforward with Spring Boot Actuator.
Step 1: Add the Dependencies
First, you need Actuator. If you're using Spring Boot, it's just three lines in your pom.xml:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
That's it. Spring Boot does the rest.
Step 2: Custom Health Indicators for Everything MCP Needs
Here's where we go beyond the basics. Spring Boot gives you some auto-configured health checks for JDBC, Redis, etc., but we need to add MCP-specific checks.
This is my McpHealthIndicator.java:
@Component
public class McpHealthIndicator implements HealthIndicator {
private final ToolRegistry toolRegistry;
private final KnowledgeBaseService knowledgeBase;
private final ObjectMapper objectMapper;
// Constructor injection...
@Override
public Health health() {
Health.Builder builder = Health.up();
// Check 1: Do we have any tools registered?
int toolCount = toolRegistry.getAllTools().size();
builder.withDetail("toolCount", toolCount);
if (toolCount == 0) {
// This is bad—no tools means MCP is useless
return Health.down()
.withDetail("error", "No tools registered in MCP server")
.withDetail("toolCount", 0)
.build();
}
// Check 2: Can we actually serialize a tool call result?
// This catches JSON serialization issues
try {
// Try a simple search query to make sure everything works
SearchResult result = knowledgeBase.search("test", 1);
String json = objectMapper.writeValueAsString(result);
builder.withDetail("searchTest", "success");
builder.withDetail("resultSize", json.length());
} catch (Exception e) {
return Health.down(e)
.withDetail("error", "MCP search test failed")
.withDetail("toolCount", toolCount)
.build();
}
return builder.build();
}
}
That's it. Nothing fancy. It just:
- Makes sure we actually have tools registered
- Actually runs a search through the whole pipeline to make sure it works
- Catches any JSON serialization issues before a client hits them
I added similar custom indicators for PostgreSQL, Redis, and Cloudflare R2 storage:
@Component
public class DatabaseHealthIndicator implements HealthIndicator {
private final JdbcTemplate jdbcTemplate;
public DatabaseHealthIndicator(JdbcTemplate jdbcTemplate) {
this.jdbcTemplate = jdbcTemplate;
}
@Override
public Health health() {
try {
// Actually run a query—don't just check if the connection works
Integer count = jdbcTemplate.queryForObject("SELECT COUNT(*) FROM papers", Integer.class);
return Health.up()
.withDetail("totalPapers", count)
.build();
} catch (Exception e) {
return Health.down(e).build();
}
}
}
The key insight here: don't just check that the connection exists—actually run a query. I've seen too many cases where the connection pool says "connected" but actually executing a query fails because of some weird network issue.
Step 3: Configure Endpoints Expose What You Need
In your application.properties:
# Expose health endpoints
management.endpoints.web.exposure.include=health,info
# Enable all health details
management.endpoint.health.show-details=always
# Group our three health checks
management.endpoint.health.group.liveness.include=ping
management.endpoint.health.group.readiness.include=*,mcp
management.endpoint.health.group.mcp.include=*,mcp
What this gives you:
-
GET /actuator/health/liveness→ Quick liveness check, just says "we're running" -
GET /actuator/health/readiness→ Full readiness check with everything, including MCP -
GET /actuator/health→ Default gives you everything
Step 4: Add an Automated MCP End-to-End Test
Here's the thing that really caught me by surprise—even if all your dependencies are up, MCP can still fail because of protocol issues. So I added a really simple end-to-end test that actually calls the MCP tools/list endpoint from within the health check.
Wait, isn't that circular? Calling the endpoint from within the health check? Yeah, kinda. But it catches serialization issues, CORS issues, content-length problems—all the stuff that's really hard to catch any other way.
@Component
public class McpProtocolHealthIndicator implements HealthIndicator {
private final McpServerController controller;
private final ObjectMapper objectMapper;
// constructor...
@Override
public Health health() {
try {
// Call tools/list directly through the controller
ResponseEntity<List<ToolInfo>> response = controller.listTools();
if (response.getStatusCode() != HttpStatus.OK) {
return Health.down()
.withDetail("statusCode", response.getStatusCodeValue())
.withDetail("error", "tools/list returned non-OK status")
.build();
}
List<ToolInfo> tools = response.getBody();
if (tools == null || tools.isEmpty()) {
return Health.down()
.withDetail("toolCount", tools != null ? tools.size() : 0)
.withDetail("error", "No tools returned from tools/list")
.build();
}
// Try serializing to JSON to catch any issues
String json = objectMapper.writeValueAsString(tools);
return Health.up()
.withDetail("toolCount", tools.size())
.withDetail("jsonSizeBytes", json.length())
.build();
} catch (Exception e) {
return Health.down(e)
.withDetail("error", "MCP protocol test failed")
.build();
}
}
}
I know, this seems excessive. But let me tell you—this caught a bug where I had forgotten to add Jackson annotations to a new field in my ToolInfo class, and serialization was failing. The app was running, all dependencies were up, but every MCP client connection would fail on tools/list. This health check caught it immediately.
Would I have found it eventually testing manually? Yeah, probably. But why wait for a user to tell you when your health check can tell you first?
What I Learned the Hard Way: Common MCP Health Check Pitfalls
Okay, so I've been running this setup for a few months now. Here are the mistakes I made that you can avoid:
Mistake 1: Not Handling Authentication in Health Checks
Here's a funny one—I added authentication to my MCP server (because multiple users with different API keys), but I forgot to allow health checks without authentication. So my monitoring system was getting 401 Unauthorized from the health endpoint, and it kept restarting the server even though everything was actually fine.
Fix: Exclude health endpoints from authentication in your security config:
@Override
public void configure(HttpSecurity http) throws Exception {
http
.authorizeHttpRequests(auth -> auth
.requestMatchers("/actuator/**").permitAll()
.anyRequest().authenticated()
);
}
I know this seems obvious. I did it anyway. You will too. Probably.
Mistake 2: Health Checks That Are Too Expensive
I initially made my health check do a full search across the entire database. That's overkill. It slowed down every health check request, and if your monitoring is hitting it every 30 seconds, you're just wasting database connections for no reason.
Fix: Keep it simple. Run a SELECT 1 or count a small table. The search test I do only asks for 1 result. It's fast enough—usually under 10ms.
Mistake 3: Logging Sensitive Data in Health Checks
Wait, health checks return details—including things like API keys, connection strings, row counts. If your health endpoint is publicly accessible (which it often is, because you don't want to authenticate it), don't put sensitive stuff in there.
I learned this when I accidentally logged the full database connection URL including password in the health details. Oops.
Fix: Only put non-sensitive counts and statuses in details. Never log full API keys, never log passwords, never log user data.
// Good
builder.withDetail("toolCount", toolCount); // okay, just a number
// Bad
builder.withDetail("databaseUrl", dbUrl); // don't do this if it has credentials
Mistake 4: Forgetting to Check Disk Space
This one bit me. My server was running, database was up, everything looked good—but the disk was full. Any time I tried to write a log message, it failed. Once the disk is full, weird things start happening.
The good news—Spring Boot Actuator actually checks this automatically for you! If you have actuator on the classpath, it includes a disk space health check by default. You don't have to do anything. Isn't that nice?
You can configure the threshold if you want:
management.health.diskspace.threshold=104857600 # 100MB minimum free
Pros and Cons: Is This Worth Doing For Your MCP Server?
Let me be real with you—adding all this takes a couple of hours. Is it actually worth it?
Pros ✅
- Catches problems early: I've had three separate outages that the health check caught before any users complained. That's worth the couple hours of setup right there.
- Docker/Kubernetes loves it: If you're running in any container environment, liveness/readiness checks are expected anyway. This just fits in.
- Documentation bonus: Having health checks forces you to actually think through what could fail. I found a couple of architectural gaps just going through this exercise.
- It's really simple: You don't need a fancy APM tool. This is all vanilla Spring Boot, no extra services needed.
- Catches MCP-specific issues: Things like empty tool lists, serialization problems—these don't show up in generic health checks.
Cons ❌
- Adds a little code: You end up with 3-4 extra classes. Not much, but it's still code you have to maintain.
- Slower health checks: If you do full checks every time, it's not instantaneous. But honestly—who cares? Monitoring doesn't hit it that often.
- Still won't catch everything: If the LLM hallucinates a bad tool call in flight, your health check can't help with that. But it catches the infrastructure issues, which is 80% of the problems anyway.
When Should You Do This? When Should You Skip It?
Do this if:
- Other people are actually using your MCP server
- You're running it in production 24/7
- You have external dependencies like databases, Redis, cloud storage
- You want to sleep better at night knowing you'll get alerted if something breaks
Skip it if:
- It's just a personal project you only use when you're sitting at your computer
- You're still prototyping and everything changes every day
- You don't mind restarting it manually when it breaks
For me—my Papers MCP server is something I use every single day with Claude Desktop. I want it to just work. This was definitely worth the half-day I spent putting it together.
Wrapping Up: It Doesn't Have to Be Perfect, It Just Has to Work
Here's what I want you to take away from this:
MCP servers are just like any other server—they fail. But because of the proxy nature of MCP, they have more failure points than your average API. A little bit of intentional health checking goes a really long way.
You don't need Prometheus, Grafana, Elasticsearch, APM, all that fancy stuff. You don't need an SRE team. If you're building a side project like me, you can get 80% of the value with 20% of the effort using just Spring Boot Actuator and a couple of custom health indicators.
My MCP server has been running for six months now. Before I added health checks, I had an outage about once a week that I didn't catch for hours. After adding health checks (and setting up simple alerts with UptimeRobot), I get notified within five minutes when something goes wrong. That's a huge quality of life improvement.
The full code for everything I showed you is in the Papers repository on GitHub if you want to go look at it. Everything I've talked about here is in there, complete with working examples you can steal for your own MCP server.
Your Turn
Have you built an MCP server? Have you had outages that could've been caught with better health checking? What's your go-to simple monitoring setup for side projects? I'd love to hear your experiences in the comments—especially if you've found simpler ways to do this.
I'm always looking for ways to make my MCP infrastructure more reliable without adding complexity. What works for you?
Top comments (0)