DEV Community

Cover image for Fixing Prometheus namespace monitoring
Santiago Salazar Pavajeau
Santiago Salazar Pavajeau

Posted on

Fixing Prometheus namespace monitoring

The Setup

I’m running a Kubernetes cluster with Prometheus Operator and a pretty standard discovery pattern:

  • ServiceMonitor objects
  • Namespace-based scrape selection
  • Grafana dashboards and Alertmanager rules downstream

The key piece is that Prometheus only scrapes namespaces with a specific label.

Here’s a simplified version of the ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-servicemonitor
spec:
  selector:
    matchLabels:
      app: payments-api
  namespaceSelector:
    matchLabels:
      monitoring: enabled
  endpoints:
    - port: metrics
      interval: 30s
Enter fullscreen mode Exit fullscreen mode

At this point, namespace labels are effectively part of the monitoring config.


Breaking It

For the exercise, I removed the monitoring=enabled label from the namespace.

Nothing crashed.

Pods kept running.

Metrics quietly disappeared.

Exactly the kind of failure that’s easy to miss.


Detecting What’s Wrong

First thing I checked was whether Prometheus itself was unhealthy:

kubectl get pods -n monitoring
kubectl logs prometheus-k8s-0 -n monitoring
Enter fullscreen mode Exit fullscreen mode

Everything looked fine.

Next, I checked whether Prometheus was scraping anything from the namespace:

up{namespace="payments-prod"}
Enter fullscreen mode Exit fullscreen mode

No results.

That tells me Prometheus isn’t scraping targets — not that the app is down.


Finding the Cause

Next step was checking the namespace itself:

kubectl get namespace payments-prod --show-labels
Enter fullscreen mode Exit fullscreen mode

Output looked like this:

monitoring=disabled
Enter fullscreen mode Exit fullscreen mode

Since the ServiceMonitor relies on:

namespaceSelector:
  matchLabels:
    monitoring: enabled
Enter fullscreen mode Exit fullscreen mode

Prometheus was doing exactly what it was configured to do.

From Prometheus’ perspective, nothing was broken.


Fixing It

Restoring the label was enough:

kubectl label namespace payments-prod monitoring=enabled --overwrite
Enter fullscreen mode Exit fullscreen mode

Metrics came back within a scrape interval.

Quick check to confirm freshness:

time() - timestamp(up{namespace="payments-prod"})
Enter fullscreen mode Exit fullscreen mode

Everything was visible again.


Why This Is a Sneaky Failure

The interesting part of this exercise:

  • No alerts fired
  • Dashboards were empty, not red
  • Prometheus treats missing data as “nothing to evaluate”

If something real had broken during this window, I wouldn’t have known.

This is how you end up trusting “green” systems that you’re actually blind to.


Hardening Things Afterward

Alerting on Missing Metrics

I added an explicit alert to catch telemetry loss:

- alert: NamespaceMetricsMissing
  expr: absent(up{namespace="payments-prod"})
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "No metrics from payments-prod"
    description: "Prometheus is scraping zero targets in this namespace."
Enter fullscreen mode Exit fullscreen mode

Silence should page you.


Top comments (1)

Collapse
 
onlineproxyio profile image
OnlineProxy

Your approach is rock solid, but.. namespace labels have basically become critical infrastructure now, except nobody's really treating them that way since they just kinda live outside your monitoring setup and can vanish without a trace. So yeah, the absent() alert is a great start, but if you really wanna lock this down, I'd throw more layers at it. Slap an admission controller in there using Kyverno or OPA to actually block label deletions at the API server level, stick your namespace definitions in GitOps so every change gets audited and someone has to sign off on it, and add some meta-monitoring that watches prometheus_target_scrape_pools_active_targets versus desired_targets to catch blind spots even when individual namespaces aren't screaming into your alerts. Basically, you're flipping this from a configuration drift oops problem into a policy enforcement thing - which is way easier to handle when you're scaling up