DevHelm

Posted on Jun 8 • Originally published at devhelm.io

MCP Server Monitoring: How to Keep AI Agent Infrastructure Reliable

#ai #guides #reliability

Model Context Protocol (MCP) servers give AI agents access to tools — database queries, file operations, API calls, code execution. When your MCP server goes down, every agent that depends on it stops being useful. If Cursor can't reach your MCP server, your AI coding assistant loses access to your codebase tools. If Claude Desktop can't reach it, your automation workflows break.

We run an MCP server in production that gives AI agents access to DevHelm's monitoring capabilities — creating monitors, checking status, managing incidents. When that server is unhealthy, our users' agent workflows degrade silently. The agent doesn't crash; it just can't call the tools it needs, and the user gets unhelpful responses without understanding why.

This guide covers how to monitor MCP servers based on what we've learned running one. The failure modes are specific to the MCP protocol, and most traditional monitoring approaches miss them.

What can go wrong

MCP servers fail in ways that are distinct from typical REST APIs:

The server is up but tools are broken

An MCP server that responds to health checks but returns errors on tool calls is the most common failure mode. The server process is running, the TCP port is open, but the underlying tool implementations are failing — a database connection pool is exhausted, an API key has expired, a dependency service is down.

A simple "is the port open" check passes. A check that actually calls a tool with a known-good input catches the real failure.

Slow tool execution degrades agent performance

MCP tool calls have latency budgets imposed by the AI agent's architecture. If a tool call takes 30 seconds, the agent is blocked for 30 seconds — and the user is waiting. Unlike a web API where users see a loading spinner, a slow MCP tool call manifests as the agent appearing to "think" for too long before producing output.

Track p95 tool call latency per tool. Set alerts when latency exceeds the agent's patience threshold (typically 10–30 seconds depending on the agent framework).

Authentication failures are silent

Most MCP server implementations require an API token or session credential. When the credential expires or is revoked, tool calls fail with authentication errors. The agent handles this by telling the user "I couldn't access that tool" — but neither the agent nor the user knows why. The failure looks identical to "the tool doesn't exist" from the agent's perspective.

Monitor authentication success rate separately from tool success rate. A spike in auth failures is a different remediation path than a spike in tool execution errors.

Schema drift between server and client

When you update your MCP server and add new tools, rename parameters, or change return types, existing agent configurations may send requests that no longer match the server's schema. The server rejects the request, the agent fails to call the tool, and the user gets a degraded experience.

This is analogous to API versioning in REST, but MCP tooling is younger and versioning practices are less established. Monitor schema-related errors (invalid parameters, unknown tools) as a distinct error class.

What to monitor

1. Health endpoint availability

The minimum viable monitor: check that your MCP server responds on its configured port. For HTTP-based MCP servers (SSE transport), this is a standard HTTP health check. For stdio-based servers, monitoring is harder — you need a wrapper process that exercises the server.

# For an HTTP/SSE MCP server running on port 8080
curl -sf http://mcp-server:8080/health || echo "MCP server is down"

Set up this check at app.devhelm.io with a 30-second interval. This catches process crashes, container restarts, and network issues.

2. Tool-level synthetic checks

A health endpoint check proves the server is running. A synthetic tool call proves the tools work. Create a lightweight "canary" tool or use an existing read-only tool with a known-good input:

# Call a known-good tool and verify the response
curl -sf -X POST http://mcp-server:8080/tools/list_monitors \
  -H "Authorization: Bearer $MCP_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"limit": 1}' | jq '.result | length'

This validates the full path: authentication, tool resolution, execution, response serialization. Run it every 60 seconds.

3. Response time per tool

Track latency at the tool level, not just the server level. A list_monitors call that takes 50ms and a create_monitor call that takes 5 seconds have different performance profiles. When the agent switches from one tool to another and the interaction feels slower, per-tool latency metrics point you to the specific bottleneck.

If you've instrumented your MCP server with OpenTelemetry, each tool call produces a span with timing data. The OTel GenAI semantic conventions include execute_tool spans with gen_ai.tool.name — see our agent observability guide for the instrumentation pattern.

4. Error rate by category

Categorize errors into:

Infrastructure errors — connection refused, timeout, OOM
Authentication errors — invalid token, expired credential
Tool execution errors — the tool ran but failed (database error, external API failure)
Schema errors — invalid parameters, unknown tool name
Rate limit errors — too many requests

Each category has a different remediation path. Infrastructure errors need ops attention. Auth errors need credential rotation. Tool execution errors need investigation into the underlying dependency. Schema errors suggest a client-server version mismatch.

5. Dependency health

Your MCP server's tools depend on external services. Our MCP server calls the DevHelm API — if the API is down, every tool call fails even though the MCP server itself is healthy. Monitor the services your MCP server depends on as first-class monitoring targets.

This is the same dependency monitoring pattern that applies to any service, but it's especially important for MCP servers because the failure is invisible to the end user. When a REST API's dependency fails, the user sees an error page. When an MCP server's dependency fails, the user sees an AI agent that gives unhelpful answers.

Architecture for production MCP servers

A production MCP server deployment should include:

Health endpoint — a simple /health route that returns 200 if the server is ready to accept tool calls
Structured logging — JSON logs with tool name, duration, result status, and error details for every tool call (see Winston vs Pino for Node.js options)
OTel instrumentation — spans for each tool call, with attributes following the GenAI semantic conventions
External monitoring — health checks and synthetic tool calls from outside your infrastructure
Alerting — notifications when the server is down, when tool latency exceeds thresholds, or when error rates spike

The external monitoring layer is critical because MCP servers are typically accessed by AI agents running on users' machines (Cursor, Claude Desktop). You can't rely on client-side error reporting — the agent may retry silently, degrade gracefully, or simply not report the failure.

Monitoring your MCP server with DevHelm

Set up monitoring for your MCP server in three steps:

Step 1: Create a health check monitor. Monitor your MCP server's health endpoint with a 30-second check interval. This catches availability issues — process crashes, OOM kills, network partitions.

Step 2: Create a synthetic tool-call monitor. Use an HTTP monitor that POSTs to a read-only tool endpoint with valid authentication. Assert on status code 200 and a non-empty response body. This catches tool-level failures that a simple health check misses.

Step 3: Monitor your dependencies. Add monitors for every external service your MCP server depends on — your API, your database, any third-party services. When a tool call fails, the dependency monitors tell you immediately whether the failure is in your MCP server or in something it depends on. This reduces your MTTR from "debug the entire stack" to "check the dependency dashboard."

Get started at app.devhelm.io — the health check monitor takes 60 seconds to set up, and you'll catch the next MCP server outage before your users notice their agents stopped working.

Originally published on DevHelm.

DEV Community