The Setup
I’m running a Kubernetes cluster with Prometheus Operator and a pretty standard discovery pattern:
-
ServiceMonitorobjects - Namespace-based scrape selection
- Grafana dashboards and Alertmanager rules downstream
The key piece is that Prometheus only scrapes namespaces with a specific label.
Here’s a simplified version of the ServiceMonitor:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-servicemonitor
spec:
selector:
matchLabels:
app: payments-api
namespaceSelector:
matchLabels:
monitoring: enabled
endpoints:
- port: metrics
interval: 30s
At this point, namespace labels are effectively part of the monitoring config.
Breaking It
For the exercise, I removed the monitoring=enabled label from the namespace.
Nothing crashed.
Pods kept running.
Metrics quietly disappeared.
Exactly the kind of failure that’s easy to miss.
Detecting What’s Wrong
First thing I checked was whether Prometheus itself was unhealthy:
kubectl get pods -n monitoring
kubectl logs prometheus-k8s-0 -n monitoring
Everything looked fine.
Next, I checked whether Prometheus was scraping anything from the namespace:
up{namespace="payments-prod"}
No results.
That tells me Prometheus isn’t scraping targets — not that the app is down.
Finding the Cause
Next step was checking the namespace itself:
kubectl get namespace payments-prod --show-labels
Output looked like this:
monitoring=disabled
Since the ServiceMonitor relies on:
namespaceSelector:
matchLabels:
monitoring: enabled
Prometheus was doing exactly what it was configured to do.
From Prometheus’ perspective, nothing was broken.
Fixing It
Restoring the label was enough:
kubectl label namespace payments-prod monitoring=enabled --overwrite
Metrics came back within a scrape interval.
Quick check to confirm freshness:
time() - timestamp(up{namespace="payments-prod"})
Everything was visible again.
Why This Is a Sneaky Failure
The interesting part of this exercise:
- No alerts fired
- Dashboards were empty, not red
- Prometheus treats missing data as “nothing to evaluate”
If something real had broken during this window, I wouldn’t have known.
This is how you end up trusting “green” systems that you’re actually blind to.
Hardening Things Afterward
Alerting on Missing Metrics
I added an explicit alert to catch telemetry loss:
- alert: NamespaceMetricsMissing
expr: absent(up{namespace="payments-prod"})
for: 5m
labels:
severity: critical
annotations:
summary: "No metrics from payments-prod"
description: "Prometheus is scraping zero targets in this namespace."
Silence should page you.
Top comments (0)