Adil Khan

Posted on Mar 28

I Built a Kubernetes Monitoring Stack — And Breaking It Was the Best Part

#kubernetes #devops #grafana #prometheus

I didn't build this project to add a line to my resume.

I built it because I kept reading about Prometheus and Grafana, nodding along like I understood it, and then freezing when someone asked me "so how does Prometheus actually discover your pods?"

I didn't know. Not really.

So I decided to stop reading and start breaking things.

What I built

A complete observability pipeline from scratch:

A Python Flask app with custom Prometheus metrics
Deployed on Kubernetes with 3 replicas
Scraped by Prometheus via ServiceMonitor
Visualized in Grafana with PromQL dashboards
Load tested with real traffic using hey

The repo is here if you want to follow along:
👉 github.com/adil-khan-723/k8s-observability-stack

But the code isn't the interesting part. What I learned by watching it fail is.

Mistake #1 — I used raw counters in Grafana and wondered why nothing made sense

First dashboard. I added http_requests_total as a panel. The number just kept climbing. 1000. 5000. 23000.

I stared at it thinking "okay... is that good?"

It tells you nothing. A counter that only goes up is like a car's odometer — it doesn't tell you how fast you're going right now.

The correct query is:

sum(rate(http_requests_total[1m]))

rate() calculates requests per second over the last minute. That's a number you can actually act on. After switching to this, I could see exactly when traffic spiked during load testing and when it dropped off.

Lesson: metrics alone are useless. PromQL creates insight.

Mistake #2 — I set the rate window too large and hid all my spikes

Once I had rate() working, I noticed my graph looked... suspiciously smooth. Almost like nothing was happening even during load tests.

I was using rate(http_requests_total[8m]). An 8-minute window averages out everything. A spike that lasted 30 seconds disappears completely.

Switched to [1m]. Suddenly I could see exactly what happened during the load test — a sharp climb, a plateau, a drop. Real information.

The dashboard also had stacked graphs enabled. Stacking makes it look like total traffic is the sum of all the colored areas, which is visually misleading when you're trying to compare per-pod behavior. Disabled it immediately.

Mistake #3 — Prometheus showed all targets DOWN and I had no idea why

This one took me a while.

I ran load tests, checked Prometheus UI under Status → Targets, and saw this:

context deadline exceeded

All 3 pods. All DOWN.

My first instinct was to blame the ServiceMonitor config. I triple-checked the labels. Everything matched. The problem wasn't discovery — Prometheus was finding the pods fine. It just couldn't scrape them in time.

Root cause: I had a time.sleep(2) sitting in my home route. This slowed down the entire Gunicorn worker. When Prometheus tried to hit /metrics, sometimes the worker was busy sleeping and the scrape timed out.

The fix was two things:

Increased the scrape interval in serviceMonitor.yaml to give more breathing room
Removed the artificial delay (or accounted for it explicitly)

Watched all 3 targets flip back to UP in real time. That was a good moment.

The deeper lesson: your monitoring system depends on your application's performance. A slow app breaks its own observability. This is why you should never put business logic on the /metrics endpoint.

Mistake #4 — I assumed Running meant healthy

After fixing the scrape issue, I noticed something strange. kubectl get pods showed all pods as Running. But requests were still failing intermittently.

I had been treating Running as "everything is fine." It isn't.

Running just means the container process started. It says nothing about whether the application inside is actually ready to serve traffic. A pod can be Running while your Flask app is still initializing, or while it's in a broken state that hasn't crashed the process.

The fix was a readinessProbe:

readinessProbe:
  httpGet:
    path: /metrics
    port: 5000
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2

Once this was in place, Kubernetes automatically removed unhealthy pods from the Service's endpoint list. Traffic only went to pods that were actually ready. The intermittent failures stopped.

Running ≠ healthy. Readiness probes are not optional.

The thing that surprised me most — load isn't evenly distributed

During load testing I split the Grafana panel to show per-pod request rates:

rate(http_requests_total[1m])

I expected three roughly equal lines. What I got were three noticeably different ones. One pod was consistently getting more traffic than the others.

Kubernetes Services use round-robin at the connection level, not the request level. Under high concurrency, some pods end up holding more long-lived connections and therefore handle more requests.

If I had only been looking at sum(rate(http_requests_total[1m])) — the aggregate — I would never have seen this. The sum looked perfectly healthy. The per-pod view told a completely different story.

This is why per-pod metrics exist. Aggregates hide things.

What the ServiceMonitor actually does (and why it's confusing at first)

The part that confused me most before building this was the relationship between Prometheus and Kubernetes.

Prometheus doesn't scrape pods. It scrapes Services. And it discovers those Services through a custom resource called a ServiceMonitor.

The chain looks like this:

ServiceMonitor → matches Service labels → Service resolves to Pod IPs → Prometheus scrapes each pod

For this to work, three things have to align exactly:

The ServiceMonitor's selector must match the Service's labels
The ServiceMonitor's namespace must have the release: monitoring label (or whatever your Helm release is named)
The namespaceSelector must point to where the Service lives

Get any one of these wrong and Prometheus simply never discovers the target. No error. Just silence. This is the part where most people spend hours debugging.

What I'd add next

Alertmanager — fire alerts when scrape targets go DOWN or error rate spikes
HPA — autoscale pods based on custom Prometheus metrics
Loki — correlate logs with metric anomalies
Pinned dependency versions — requirements.txt currently has no versions, which is a reproducibility risk

Final thought

The most valuable part of this project wasn't setting up Prometheus or building the Grafana dashboard. It was the moment I saw context deadline exceeded and had to actually figure out why.

You don't learn observability by reading about it. You learn it by watching your own system fail and having to diagnose it with the tools you built.

If you're learning DevOps or platform engineering, build something, break it, and read the metrics.

Repo: github.com/adil-khan-723/k8s-observability-stack

Have you run into similar issues with Prometheus scraping? Drop it in the comments — would love to hear what weird things you've debugged.