DEV Community

Chris Burns for Stacklok

Posted on

The Next Big Observability Gap for Kubernetes is MCP Servers

Kubernetes has become the de facto operating system for the cloud, empowering organizations to scale and orchestrate their workloads with unprecedented ease. But with this power comes a new set of challenges. As you break down monoliths into microservices and deploy hundreds or thousands of pods, each workload can become a potential black box. The very agility that makes Kubernetes so valuable also makes observability a monumental task.

Prometheus and OpenTelemetry (OTel) emerged as the go-to tools for making these workloads observable, yet not every system plays by the same rules. A growing example is Model Context Protocol (MCP) servers, which often don't expose metrics at all, creating blind spots in even the most sophisticated monitoring stacks. This gap highlights the next great challenge in Kubernetes observability.

In this post, we'll explore the broader observability landscape, why OTel and Prometheus work best in tandem, and why MCP highlights the gaps that still remain in our quest for comprehensive system visibility.

The Black Box Problem in Kubernetes

In a monolithic application, you often have a single point of failure and a single application log to comb through. In a Kubernetes environment, that single application is now a distributed system of dozens or hundreds of services, each with its own logs, resource consumption patterns, and unique failure modes. A single user request might traverse multiple services, making it nearly impossible to trace without a robust observability strategy.

This black box problem is amplified by several characteristics of Kubernetes environments. Ephemeral workloads mean that debugging information disappears when pods are terminated. Service mesh complexity introduces additional network hops and failure modes that aren't immediately visible. Multi-tenant clusters create resource contention that can be difficult to attribute to specific workloads. Dynamic scaling means that performance baselines are constantly shifting as replicas come and go.

Traditional monitoring approaches that rely on host-level metrics and application logs quickly become inadequate. Pods come and go, workloads scale dynamically, and ephemeral containers rarely leave behind a trail. You need telemetry that can follow requests across service boundaries, survive pod restarts, and provide insights into the distributed system as a whole rather than just individual components.

Metrics, Logs, and Traces: A Quick Refresher

Before we dive into the tools, let's quickly recap the three pillars of observability:

Metrics are numerical measurements collected over time, such as CPU utilization, request rate, error count, and response latency. They're perfect for dashboards and alerting but don't provide detail about individual requests.

Logs are timestamped event records that provide detailed context about what happened at specific points in time. Invaluable for debugging and auditing, but correlating across services can be challenging.

Traces record the journey of individual requests through distributed systems, connecting operations across service boundaries to identify bottlenecks and dependencies.

Each pillar serves different purposes, and effective observability strategies combine all three.

Prometheus: The Metrics Powerhouse

Prometheus has earned its place as the standard for collecting and storing metrics in Kubernetes environments, and for good reason. Its pull-based model is a perfect fit for a dynamic, ephemeral world where services appear and disappear constantly. The Prometheus server periodically scrapes metric endpoints (usually /metrics) exposed by applications, making it naturally aligned with Kubernetes' service discovery mechanisms.

What makes Prometheus particularly powerful in Kubernetes is its integration with platform concepts like ServiceMonitor and PodMonitor custom resources, which automatically discover services as they scale. The query language, PromQL, excels at time series analysis, making it straightforward to calculate rates, percentiles, and aggregations across multiple dimensions.

However, Prometheus has limitations. It's primarily designed for metrics, so correlating with logs and traces requires additional tooling. The pull model can also miss short-lived processes or workloads that can't expose HTTP endpoints.

OpenTelemetry: The Unified Framework

OpenTelemetry (OTel) provides a single, vendor-neutral framework for collecting metrics, logs, and traces. The OpenTelemetry SDK lets developers instrument code once and export telemetry to multiple backends - traces to Jaeger, metrics to Prometheus, logs to Elasticsearch.

Auto-instrumentation capabilities mean many applications gain observability without code changes. The OpenTelemetry Collector serves as a central hub for processing telemetry data, typically deployed in Kubernetes as both a DaemonSet and Deployment.

What makes OTel particularly valuable is its ability to correlate telemetry across all three pillars. Trace spans can include logs as events, and metrics can be tagged with trace IDs, enabling workflows like jumping from dashboard alerts to specific failing traces.

Why They're Better Together

Prometheus and OTel work best together, each excelling where the other has limitations. Standardized instrumentationthrough OTel means developers can use one toolset regardless of telemetry type, while exposing metrics in formats Prometheus can scrape.

Complementary strengths provide both high-level operational views (Prometheus) and detailed diagnostic capabilities (OTel). Shared infrastructure reduces overhead - the same Kubernetes service discovery works for both tools. Correlated troubleshooting becomes possible when metrics alerts include trace context, letting teams drill down from aggregate problems to specific failing requests.

MCP as the Case Study for Observability Gaps

Model Context Protocol (MCP) servers exemplify this challenge. These lightweight applications provide context and tools to AI systems, handling requests and managing state - all behaviors that should be observable - yet they typically operate as complete black boxes.

The core problem: Many MCP servers prioritize minimal dependencies and fast startup over telemetry. They often lack /metrics endpoints, don't log structured data, and can't be traced by standard tools. The protocol itself doesn't mandate observability standards, and established patterns don't exist yet.

Real-world impact: When AI systems behave unexpectedly, teams can't determine which MCP servers were involved. When response times increase, there's no visibility into whether bottlenecks are in the AI model, MCP server, or external systems. Even with comprehensive Prometheus and OTel stacks, MCP servers remain invisible, creating significant blind spots in otherwise well-monitored systems.

Teaser: In Our Next Post, We'll Explore ToolHive

These observability challenges with MCP servers aren't insurmountable, but they require solutions that bridge the gap between emerging technologies and established monitoring practices.

In our next post, we'll explore ToolHive, which aims to fill part of this specific gap by providing MCP tool usage data that most servers don't expose natively. We'll look at how it integrates with existing OTel and Prometheus infrastructure to make MCP servers observable within your current monitoring stack, and examine practical approaches for implementing observability patterns with other emerging technologies in Kubernetes environments.

Top comments (0)