Santiago Salazar Pavajeau

Posted on Dec 17, 2025

Fixing Prometheus namespace monitoring

#sre #kubernetes #monitoring #cloud

The Setup

I’m running a Kubernetes cluster with Prometheus Operator and a pretty standard discovery pattern:

ServiceMonitor objects
Namespace-based scrape selection
Grafana dashboards and Alertmanager rules downstream

The key piece is that Prometheus only scrapes namespaces with a specific label.

Here’s a simplified version of the ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-servicemonitor
spec:
  selector:
    matchLabels:
      app: payments-api
  namespaceSelector:
    matchLabels:
      monitoring: enabled
  endpoints:
    - port: metrics
      interval: 30s

At this point, namespace labels are effectively part of the monitoring config.

Breaking It

For the exercise, I removed the monitoring=enabled label from the namespace.

Nothing crashed.

Pods kept running.

Metrics quietly disappeared.

Exactly the kind of failure that’s easy to miss.

Detecting What’s Wrong

First thing I checked was whether Prometheus itself was unhealthy:

kubectl get pods -n monitoring
kubectl logs prometheus-k8s-0 -n monitoring

Everything looked fine.

Next, I checked whether Prometheus was scraping anything from the namespace:

up{namespace="payments-prod"}

No results.

That tells me Prometheus isn’t scraping targets — not that the app is down.

Finding the Cause

Next step was checking the namespace itself:

kubectl get namespace payments-prod --show-labels

Output looked like this:

monitoring=disabled

Since the ServiceMonitor relies on:

namespaceSelector:
  matchLabels:
    monitoring: enabled

Prometheus was doing exactly what it was configured to do.

From Prometheus’ perspective, nothing was broken.

Fixing It

Restoring the label was enough:

kubectl label namespace payments-prod monitoring=enabled --overwrite

Metrics came back within a scrape interval.

Quick check to confirm freshness:

time() - timestamp(up{namespace="payments-prod"})

Everything was visible again.

Why This Is a Sneaky Failure

The interesting part of this exercise:

No alerts fired
Dashboards were empty, not red
Prometheus treats missing data as “nothing to evaluate”

If something real had broken during this window, I wouldn’t have known.

This is how you end up trusting “green” systems that you’re actually blind to.

Hardening Things Afterward

Alerting on Missing Metrics

I added an explicit alert to catch telemetry loss:

- alert: NamespaceMetricsMissing
  expr: absent(up{namespace="payments-prod"})
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "No metrics from payments-prod"
    description: "Prometheus is scraping zero targets in this namespace."

Silence should page you.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.