Leo Marsh

Posted on Sep 30

MCP Observability - You Can't Fix What You Can't See

#mcp #ai #productivity #devops

Your MCP server just went down. Again. The last log entry? "Server started successfully" from 3 hours ago. Sound familiar?

After optimizing performance in Part 6, let's tackle the next critical challenge: actually knowing what's happening inside your MCP servers. Because flying blind in production isn't a strategy – it's a disaster waiting to happen.

The MCP Observability Challenge

MCP servers are deceptively complex beasts. They're handling:

Multiple concurrent client connections
Tool executions with varying latencies
Token consumption that directly impacts costs
State management across sessions
External API calls that can fail silently

Yet most deployments have logging that looks like this:

Server started on port 3000
Connected: client_abc123
Disconnected: client_abc123

That's not observability. That's prayer.

The Three Pillars of MCP Observability

1. Structured Logging: Beyond console.log

Stop treating logs as an afterthought. Structure them for both humans and machines:

javascript
// Bad: String concatenation nightmare
console.log("Tool " + toolName + " took " + duration + "ms");

// Good: Structured, searchable, analyzable
logger.info({
  event: "tool_execution",
  tool: toolName,
  duration_ms: duration,
  client_id: clientId,
  session_id: sessionId,
  token_usage: {
    prompt: promptTokens,
    completion: completionTokens
  },
  timestamp: new Date().toISOString()
});

Key events to log:

Connection lifecycle (connect/disconnect/error)
Tool discovery and registration
Execution start/end with duration
Token usage per request
Error conditions with full context
Resource limits hit

2. Metrics That Matter

Not all metrics are created equal. Focus on what impacts users and costs:

Response Time Metrics:

p50, p95, p99 latencies per tool
Time to first byte (TTFB)
End-to-end request duration

Resource Metrics:

Active connections
Memory usage trends
Token consumption rate
Cache hit/miss ratios

Business Metrics:

Tools usage frequency
Error rates by tool type
Cost per operation
User session lengths

Here's a simple metrics collector:

javascript
const metrics = {
  toolExecutions: new Map(),

  recordExecution(tool, duration, tokens) {
    if (!this.toolExecutions.has(tool)) {
      this.toolExecutions.set(tool, {
        count: 0,
        totalDuration: 0,
        totalTokens: 0,
        errors: 0
      });
    }

    const stats = this.toolExecutions.get(tool);
    stats.count++;
    stats.totalDuration += duration;
    stats.totalTokens += tokens;

    // Emit to your metrics backend
    metricsClient.gauge(`mcp.tool.duration.${tool}`, duration);
    metricsClient.increment(`mcp.tool.executions.${tool}`);
    metricsClient.gauge(`mcp.tokens.used.${tool}`, tokens);
  }
};

3. Distributed Tracing: Following the Breadcrumbs

When your AI agent calls Tool A, which calls Service B, which queries Database C, you need distributed tracing:

javascript
const tracer = require('opentelemetry');

async function executeToolWithTracing(tool, params) {
  const span = tracer.startSpan(`mcp.tool.${tool.name}`);

  try {
    span.setAttributes({
      'tool.name': tool.name,
      'client.id': params.clientId,
      'session.id': params.sessionId
    });

    const result = await tool.execute(params);

    span.setAttributes({
      'tool.result.size': JSON.stringify(result).length,
      'tool.success': true
    });

    return result;
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: 2, message: error.message });
    throw error;
  } finally {
    span.end();
  }
}

Real-World Debugging Scenarios

Scenario 1: The Silent Performance Degradation

Symptom: Users complain about slow responses, but your server looks fine.

Observable answer: Metrics showed p99 latency increased 300% for the database_query tool, but only for queries with >1000 tokens. The culprit? Unindexed vector searches as context grew.

Scenario 2: The Mysterious Token Explosion

Symptom: Your OpenAI bill tripled overnight.

Observable answer: Trace data revealed a recursive tool chain where summarize called fetch_context which called summarize again. Each iteration doubled token usage.

Scenario 3: The Intermittent Connection Drops

Symptom: Clients randomly disconnect after 2-3 minutes.

Observable answer: Connection logs correlated with memory metrics showed a memory leak in session state management, triggering OOM kills at exactly 2GB usage.

Your Observability Checklist

Start with these basics – you can implement them in an afternoon:

[ ] Structured Logging (30 minutes)

Add a proper logger (Winston, Pino, Bunyan)
Log all tool executions with context
Include request IDs for correlation

[ ] Basic Metrics (45 minutes)

Track execution counts and durations
Monitor active connections
Record error rates

[ ] Error Tracking (30 minutes)

Capture full error context
Group similar errors
Alert on error rate spikes

[ ] Health Endpoints (15 minutes)

/health - Is the server running?
/ready - Can it handle requests?
/metrics - Prometheus-compatible metrics

The Observability Maturity Ladder

Level 1: Flying Blind

Console.log debugging
No metrics
"Check if it's working" monitoring

Level 2: Basic Visibility

Structured logs
Simple metrics
Error alerting

Level 3: Proactive Monitoring

Distributed tracing
Custom dashboards
Anomaly detection

Level 4: Full Observability

Predictive analytics
Cost attribution
Automated remediation

Most teams are at Level 1. Getting to Level 2 takes a day. The ROI? Massive.

Key Takeaways

You can't fix what you can't see – Invest in observability before you need it
Structure your logs – Make them searchable and analyzable
Measure what matters – Focus on user-impacting and cost-driving metrics
Trace the full journey – Understand tool chains and dependencies
Start simple – Basic observability beats no observability

Remember: Every production issue you can't immediately diagnose is a sign of missing observability. The best time to add monitoring was before deployment. The second best time is now.

Next in the series: MCP Tool Composition - Building complex workflows without chaos

Want observability without the setup? Storm MCP provides built-in monitoring for all hosted servers:

Real-time logs and metrics dashboard
Error alerts before users notice issues
Performance insights to optimize costs
Zero configuration needed

Check it out at stormmcp.ai - because debugging production issues at 2 AM without proper logs isn't heroic – it's preventable.

What observability challenges are you facing with MCP? Share your debugging war stories below.

DEV Community