The Setup
I’m running a Kubernetes cluster with Prometheus Operator and a pretty standard discovery pattern:
-
ServiceMonitorobjects - Namespace-based scrape selection
- Grafana dashboards and Alertmanager rules downstream
The key piece is that Prometheus only scrapes namespaces with a specific label.
Here’s a simplified version of the ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-servicemonitor
spec:
selector:
matchLabels:
app: payments-api
namespaceSelector:
matchLabels:
monitoring: enabled
endpoints:
- port: metrics
interval: 30s
At this point, namespace labels are effectively part of the monitoring config.
Breaking It
For the exercise, I removed the monitoring=enabled label from the namespace.
Nothing crashed.
Pods kept running.
Metrics quietly disappeared.
Exactly the kind of failure that’s easy to miss.
Detecting What’s Wrong
First thing I checked was whether Prometheus itself was unhealthy:
kubectl get pods -n monitoring
kubectl logs prometheus-k8s-0 -n monitoring
Everything looked fine.
Next, I checked whether Prometheus was scraping anything from the namespace:
up{namespace="payments-prod"}
No results.
That tells me Prometheus isn’t scraping targets — not that the app is down.
Finding the Cause
Next step was checking the namespace itself:
kubectl get namespace payments-prod --show-labels
Output looked like this:
monitoring=disabled
Since the ServiceMonitor relies on:
namespaceSelector:
matchLabels:
monitoring: enabled
Prometheus was doing exactly what it was configured to do.
From Prometheus’ perspective, nothing was broken.
Fixing It
Restoring the label was enough:
kubectl label namespace payments-prod monitoring=enabled --overwrite
Metrics came back within a scrape interval.
Quick check to confirm freshness:
time() - timestamp(up{namespace="payments-prod"})
Everything was visible again.
Why This Is a Sneaky Failure
The interesting part of this exercise:
- No alerts fired
- Dashboards were empty, not red
- Prometheus treats missing data as “nothing to evaluate”
If something real had broken during this window, I wouldn’t have known.
This is how you end up trusting “green” systems that you’re actually blind to.
Hardening Things Afterward
Alerting on Missing Metrics
I added an explicit alert to catch telemetry loss:
- alert: NamespaceMetricsMissing
expr: absent(up{namespace="payments-prod"})
for: 5m
labels:
severity: critical
annotations:
summary: "No metrics from payments-prod"
description: "Prometheus is scraping zero targets in this namespace."
Silence should page you.
Top comments (1)
Your approach is rock solid, but.. namespace labels have basically become critical infrastructure now, except nobody's really treating them that way since they just kinda live outside your monitoring setup and can vanish without a trace. So yeah, the absent() alert is a great start, but if you really wanna lock this down, I'd throw more layers at it. Slap an admission controller in there using Kyverno or OPA to actually block label deletions at the API server level, stick your namespace definitions in GitOps so every change gets audited and someone has to sign off on it, and add some meta-monitoring that watches prometheus_target_scrape_pools_active_targets versus desired_targets to catch blind spots even when individual namespaces aren't screaming into your alerts. Basically, you're flipping this from a configuration drift oops problem into a policy enforcement thing - which is way easier to handle when you're scaling up