Your MCP server just went down. Again. The last log entry? "Server started successfully" from 3 hours ago. Sound familiar?
After optimizing performance in Part 6, let's tackle the next critical challenge: actually knowing what's happening inside your MCP servers. Because flying blind in production isn't a strategy – it's a disaster waiting to happen.
The MCP Observability Challenge
MCP servers are deceptively complex beasts. They're handling:
- Multiple concurrent client connections
- Tool executions with varying latencies
- Token consumption that directly impacts costs
- State management across sessions
- External API calls that can fail silently
Yet most deployments have logging that looks like this:
Server started on port 3000
Connected: client_abc123
Disconnected: client_abc123
That's not observability. That's prayer.
The Three Pillars of MCP Observability
1. Structured Logging: Beyond console.log
Stop treating logs as an afterthought. Structure them for both humans and machines:
javascript
// Bad: String concatenation nightmare
console.log("Tool " + toolName + " took " + duration + "ms");
// Good: Structured, searchable, analyzable
logger.info({
event: "tool_execution",
tool: toolName,
duration_ms: duration,
client_id: clientId,
session_id: sessionId,
token_usage: {
prompt: promptTokens,
completion: completionTokens
},
timestamp: new Date().toISOString()
});
Key events to log:
- Connection lifecycle (connect/disconnect/error)
- Tool discovery and registration
- Execution start/end with duration
- Token usage per request
- Error conditions with full context
- Resource limits hit
2. Metrics That Matter
Not all metrics are created equal. Focus on what impacts users and costs:
Response Time Metrics:
- p50, p95, p99 latencies per tool
- Time to first byte (TTFB)
- End-to-end request duration
Resource Metrics:
- Active connections
- Memory usage trends
- Token consumption rate
- Cache hit/miss ratios
Business Metrics:
- Tools usage frequency
- Error rates by tool type
- Cost per operation
- User session lengths
Here's a simple metrics collector:
javascript
const metrics = {
toolExecutions: new Map(),
recordExecution(tool, duration, tokens) {
if (!this.toolExecutions.has(tool)) {
this.toolExecutions.set(tool, {
count: 0,
totalDuration: 0,
totalTokens: 0,
errors: 0
});
}
const stats = this.toolExecutions.get(tool);
stats.count++;
stats.totalDuration += duration;
stats.totalTokens += tokens;
// Emit to your metrics backend
metricsClient.gauge(`mcp.tool.duration.${tool}`, duration);
metricsClient.increment(`mcp.tool.executions.${tool}`);
metricsClient.gauge(`mcp.tokens.used.${tool}`, tokens);
}
};
3. Distributed Tracing: Following the Breadcrumbs
When your AI agent calls Tool A, which calls Service B, which queries Database C, you need distributed tracing:
javascript
const tracer = require('opentelemetry');
async function executeToolWithTracing(tool, params) {
const span = tracer.startSpan(`mcp.tool.${tool.name}`);
try {
span.setAttributes({
'tool.name': tool.name,
'client.id': params.clientId,
'session.id': params.sessionId
});
const result = await tool.execute(params);
span.setAttributes({
'tool.result.size': JSON.stringify(result).length,
'tool.success': true
});
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: 2, message: error.message });
throw error;
} finally {
span.end();
}
}
Real-World Debugging Scenarios
Scenario 1: The Silent Performance Degradation
Symptom: Users complain about slow responses, but your server looks fine.
Observable answer: Metrics showed p99 latency increased 300% for the database_query
tool, but only for queries with >1000 tokens. The culprit? Unindexed vector searches as context grew.
Scenario 2: The Mysterious Token Explosion
Symptom: Your OpenAI bill tripled overnight.
Observable answer: Trace data revealed a recursive tool chain where summarize
called fetch_context
which called summarize
again. Each iteration doubled token usage.
Scenario 3: The Intermittent Connection Drops
Symptom: Clients randomly disconnect after 2-3 minutes.
Observable answer: Connection logs correlated with memory metrics showed a memory leak in session state management, triggering OOM kills at exactly 2GB usage.
Your Observability Checklist
Start with these basics – you can implement them in an afternoon:
[ ] Structured Logging (30 minutes)
- Add a proper logger (Winston, Pino, Bunyan)
- Log all tool executions with context
- Include request IDs for correlation
[ ] Basic Metrics (45 minutes)
- Track execution counts and durations
- Monitor active connections
- Record error rates
[ ] Error Tracking (30 minutes)
- Capture full error context
- Group similar errors
- Alert on error rate spikes
[ ] Health Endpoints (15 minutes)
-
/health
- Is the server running? -
/ready
- Can it handle requests? -
/metrics
- Prometheus-compatible metrics
The Observability Maturity Ladder
Level 1: Flying Blind
- Console.log debugging
- No metrics
- "Check if it's working" monitoring
Level 2: Basic Visibility
- Structured logs
- Simple metrics
- Error alerting
Level 3: Proactive Monitoring
- Distributed tracing
- Custom dashboards
- Anomaly detection
Level 4: Full Observability
- Predictive analytics
- Cost attribution
- Automated remediation
Most teams are at Level 1. Getting to Level 2 takes a day. The ROI? Massive.
Key Takeaways
- You can't fix what you can't see – Invest in observability before you need it
- Structure your logs – Make them searchable and analyzable
- Measure what matters – Focus on user-impacting and cost-driving metrics
- Trace the full journey – Understand tool chains and dependencies
- Start simple – Basic observability beats no observability
Remember: Every production issue you can't immediately diagnose is a sign of missing observability. The best time to add monitoring was before deployment. The second best time is now.
Next in the series: MCP Tool Composition - Building complex workflows without chaos
Want observability without the setup? Storm MCP provides built-in monitoring for all hosted servers:
- Real-time logs and metrics dashboard
- Error alerts before users notice issues
- Performance insights to optimize costs
- Zero configuration needed
Check it out at stormmcp.ai - because debugging production issues at 2 AM without proper logs isn't heroic – it's preventable.
What observability challenges are you facing with MCP? Share your debugging war stories below.
Top comments (0)