DEV Community

Cover image for MCP Observability - You Can't Fix What You Can't See
Leo Marsh
Leo Marsh

Posted on

MCP Observability - You Can't Fix What You Can't See

Your MCP server just went down. Again. The last log entry? "Server started successfully" from 3 hours ago. Sound familiar?

After optimizing performance in Part 6, let's tackle the next critical challenge: actually knowing what's happening inside your MCP servers. Because flying blind in production isn't a strategy – it's a disaster waiting to happen.

The MCP Observability Challenge

MCP servers are deceptively complex beasts. They're handling:

  • Multiple concurrent client connections
  • Tool executions with varying latencies
  • Token consumption that directly impacts costs
  • State management across sessions
  • External API calls that can fail silently

Yet most deployments have logging that looks like this:

Server started on port 3000
Connected: client_abc123
Disconnected: client_abc123
Enter fullscreen mode Exit fullscreen mode

That's not observability. That's prayer.

The Three Pillars of MCP Observability

1. Structured Logging: Beyond console.log

Stop treating logs as an afterthought. Structure them for both humans and machines:

javascript
// Bad: String concatenation nightmare
console.log("Tool " + toolName + " took " + duration + "ms");

// Good: Structured, searchable, analyzable
logger.info({
  event: "tool_execution",
  tool: toolName,
  duration_ms: duration,
  client_id: clientId,
  session_id: sessionId,
  token_usage: {
    prompt: promptTokens,
    completion: completionTokens
  },
  timestamp: new Date().toISOString()
});
Enter fullscreen mode Exit fullscreen mode

Key events to log:

  • Connection lifecycle (connect/disconnect/error)
  • Tool discovery and registration
  • Execution start/end with duration
  • Token usage per request
  • Error conditions with full context
  • Resource limits hit

2. Metrics That Matter

Not all metrics are created equal. Focus on what impacts users and costs:

Response Time Metrics:

  • p50, p95, p99 latencies per tool
  • Time to first byte (TTFB)
  • End-to-end request duration

Resource Metrics:

  • Active connections
  • Memory usage trends
  • Token consumption rate
  • Cache hit/miss ratios

Business Metrics:

  • Tools usage frequency
  • Error rates by tool type
  • Cost per operation
  • User session lengths

Here's a simple metrics collector:

javascript
const metrics = {
  toolExecutions: new Map(),

  recordExecution(tool, duration, tokens) {
    if (!this.toolExecutions.has(tool)) {
      this.toolExecutions.set(tool, {
        count: 0,
        totalDuration: 0,
        totalTokens: 0,
        errors: 0
      });
    }

    const stats = this.toolExecutions.get(tool);
    stats.count++;
    stats.totalDuration += duration;
    stats.totalTokens += tokens;

    // Emit to your metrics backend
    metricsClient.gauge(`mcp.tool.duration.${tool}`, duration);
    metricsClient.increment(`mcp.tool.executions.${tool}`);
    metricsClient.gauge(`mcp.tokens.used.${tool}`, tokens);
  }
};
Enter fullscreen mode Exit fullscreen mode

3. Distributed Tracing: Following the Breadcrumbs

When your AI agent calls Tool A, which calls Service B, which queries Database C, you need distributed tracing:

javascript
const tracer = require('opentelemetry');

async function executeToolWithTracing(tool, params) {
  const span = tracer.startSpan(`mcp.tool.${tool.name}`);

  try {
    span.setAttributes({
      'tool.name': tool.name,
      'client.id': params.clientId,
      'session.id': params.sessionId
    });

    const result = await tool.execute(params);

    span.setAttributes({
      'tool.result.size': JSON.stringify(result).length,
      'tool.success': true
    });

    return result;
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: 2, message: error.message });
    throw error;
  } finally {
    span.end();
  }
}
Enter fullscreen mode Exit fullscreen mode

Real-World Debugging Scenarios

Scenario 1: The Silent Performance Degradation

Symptom: Users complain about slow responses, but your server looks fine.

Observable answer: Metrics showed p99 latency increased 300% for the database_query tool, but only for queries with >1000 tokens. The culprit? Unindexed vector searches as context grew.

Scenario 2: The Mysterious Token Explosion

Symptom: Your OpenAI bill tripled overnight.

Observable answer: Trace data revealed a recursive tool chain where summarize called fetch_context which called summarize again. Each iteration doubled token usage.

Scenario 3: The Intermittent Connection Drops

Symptom: Clients randomly disconnect after 2-3 minutes.

Observable answer: Connection logs correlated with memory metrics showed a memory leak in session state management, triggering OOM kills at exactly 2GB usage.

Your Observability Checklist

Start with these basics – you can implement them in an afternoon:

[ ] Structured Logging (30 minutes)

  • Add a proper logger (Winston, Pino, Bunyan)
  • Log all tool executions with context
  • Include request IDs for correlation

[ ] Basic Metrics (45 minutes)

  • Track execution counts and durations
  • Monitor active connections
  • Record error rates

[ ] Error Tracking (30 minutes)

  • Capture full error context
  • Group similar errors
  • Alert on error rate spikes

[ ] Health Endpoints (15 minutes)

  • /health - Is the server running?
  • /ready - Can it handle requests?
  • /metrics - Prometheus-compatible metrics

The Observability Maturity Ladder

Level 1: Flying Blind

  • Console.log debugging
  • No metrics
  • "Check if it's working" monitoring

Level 2: Basic Visibility

  • Structured logs
  • Simple metrics
  • Error alerting

Level 3: Proactive Monitoring

  • Distributed tracing
  • Custom dashboards
  • Anomaly detection

Level 4: Full Observability

  • Predictive analytics
  • Cost attribution
  • Automated remediation

Most teams are at Level 1. Getting to Level 2 takes a day. The ROI? Massive.

Key Takeaways

  1. You can't fix what you can't see – Invest in observability before you need it
  2. Structure your logs – Make them searchable and analyzable
  3. Measure what matters – Focus on user-impacting and cost-driving metrics
  4. Trace the full journey – Understand tool chains and dependencies
  5. Start simple – Basic observability beats no observability

Remember: Every production issue you can't immediately diagnose is a sign of missing observability. The best time to add monitoring was before deployment. The second best time is now.


Next in the series: MCP Tool Composition - Building complex workflows without chaos

Want observability without the setup? Storm MCP provides built-in monitoring for all hosted servers:

  • Real-time logs and metrics dashboard
  • Error alerts before users notice issues
  • Performance insights to optimize costs
  • Zero configuration needed

Check it out at stormmcp.ai - because debugging production issues at 2 AM without proper logs isn't heroic – it's preventable.


What observability challenges are you facing with MCP? Share your debugging war stories below.

Top comments (0)