DEV Community

Cover image for Fixing Prometheus namespace monitoring
Santiago Salazar Pavajeau
Santiago Salazar Pavajeau

Posted on

Fixing Prometheus namespace monitoring

The Setup

I’m running a Kubernetes cluster with Prometheus Operator and a pretty standard discovery pattern:

  • ServiceMonitor objects
  • Namespace-based scrape selection
  • Grafana dashboards and Alertmanager rules downstream

The key piece is that Prometheus only scrapes namespaces with a specific label.

Here’s a simplified version of the ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-servicemonitor
spec:
  selector:
    matchLabels:
      app: payments-api
  namespaceSelector:
    matchLabels:
      monitoring: enabled
  endpoints:
    - port: metrics
      interval: 30s
Enter fullscreen mode Exit fullscreen mode

At this point, namespace labels are effectively part of the monitoring config.


Breaking It

For the exercise, I removed the monitoring=enabled label from the namespace.

Nothing crashed.

Pods kept running.

Metrics quietly disappeared.

Exactly the kind of failure that’s easy to miss.


Detecting What’s Wrong

First thing I checked was whether Prometheus itself was unhealthy:

kubectl get pods -n monitoring
kubectl logs prometheus-k8s-0 -n monitoring
Enter fullscreen mode Exit fullscreen mode

Everything looked fine.

Next, I checked whether Prometheus was scraping anything from the namespace:

up{namespace="payments-prod"}
Enter fullscreen mode Exit fullscreen mode

No results.

That tells me Prometheus isn’t scraping targets — not that the app is down.


Finding the Cause

Next step was checking the namespace itself:

kubectl get namespace payments-prod --show-labels
Enter fullscreen mode Exit fullscreen mode

Output looked like this:

monitoring=disabled
Enter fullscreen mode Exit fullscreen mode

Since the ServiceMonitor relies on:

namespaceSelector:
  matchLabels:
    monitoring: enabled
Enter fullscreen mode Exit fullscreen mode

Prometheus was doing exactly what it was configured to do.

From Prometheus’ perspective, nothing was broken.


Fixing It

Restoring the label was enough:

kubectl label namespace payments-prod monitoring=enabled --overwrite
Enter fullscreen mode Exit fullscreen mode

Metrics came back within a scrape interval.

Quick check to confirm freshness:

time() - timestamp(up{namespace="payments-prod"})
Enter fullscreen mode Exit fullscreen mode

Everything was visible again.


Why This Is a Sneaky Failure

The interesting part of this exercise:

  • No alerts fired
  • Dashboards were empty, not red
  • Prometheus treats missing data as “nothing to evaluate”

If something real had broken during this window, I wouldn’t have known.

This is how you end up trusting “green” systems that you’re actually blind to.


Hardening Things Afterward

Alerting on Missing Metrics

I added an explicit alert to catch telemetry loss:

- alert: NamespaceMetricsMissing
  expr: absent(up{namespace="payments-prod"})
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "No metrics from payments-prod"
    description: "Prometheus is scraping zero targets in this namespace."
Enter fullscreen mode Exit fullscreen mode

Silence should page you.


Top comments (0)