Prometheus + Grafana on Kubernetes
What you’ll accomplish
- Install a production-style monitoring stack (Prometheus, Alertmanager, Grafana, Node Exporter, Kube-State-Metrics)
- Explore Prometheus UI and run PromQL queries
- Import Grafana dashboards and see live cluster metrics
- Create and test an alert (Slack optional)
- Stress the cluster to “see the graphs move”
- (Optional) Expose your own application metric /metrics
Agenda (you can keep time with this)
| Time | Topic | 
|---|---|
| 0:00–0:10 | Monitoring basics (what & why) | 
| 0:10–0:25 | Prometheus & Grafana concepts | 
| 0:25–0:35 | Cluster prep (K8s, kubectl, Helm) | 
| 0:35–1:05 | Install kube-prometheus-stack (Helm) | 
| 1:05–1:25 | Prometheus UI + PromQL basics | 
| 1:25–1:45 | Grafana dashboards (import & explore) | 
| 1:45–2:05 | Alerts: create, load, test | 
| 2:05–2:25 | Slack notifications (optional but great) | 
| 2:25–2:45 | Load test to trigger alert (wow moment) | 
| 2:45–3:00 | Troubleshooting, cleanup, next steps | 
Section 0 — Prerequisites (very clear)
You need:
- 
A Kubernetes cluster (any of these is fine) - EKS (AWS), Minikube, Docker Desktop (Kubernetes enabled), kind, or kubeadm
 
- kubectlconnected to that cluster
- helminstalled (v3+)
Quick start (Minikube) — Mac or Linux
# Install minikube if needed (Mac via Homebrew)
brew install minikube
# Start a local cluster with 3 nodes (better metrics variety)
minikube start --nodes=3 --cpus=2 --memory=4096
Quick check
kubectl get nodes
# Expect to see Ready nodes
Section 1 — Monitoring basics (10 min, explain in your words)
- Why monitor? To detect issues early (CPU/memory saturation, errors, latency).
- 
Metrics vs Logs vs Traces - Metrics = numbers over time (cheap, fast to aggregate)
- Logs = text events (debug detail)
- Traces = end-to-end request path (latency analysis)
 
- Where metrics come from? Exporters & app endpoints (usually - /metrics).
Section 2 — Concepts you’ll need (clear & short)
- Prometheus: scrapes endpoints periodically, stores time-series.
- Exporters: programs exposing metrics (node-exporter, kube-state-metrics, blackbox).
- PromQL: query language to analyze numbers over time.
- Alertmanager: routes alerts (Slack/Email/PagerDuty).
- Grafana: dashboards for visualization.
Pull model: Prometheus pulls data from targets (easier to scale, discover via service discovery).
Section 3 — Install kube-prometheus-stack (the easy + standard way)
This Helm chart bundles: Prometheus, Alertmanager, Grafana, Node Exporter, Kube-State-Metrics, rules & dashboards.
3.1 Add Helm repo & update
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
3.2 Install into its own namespace
helm install monitoring prometheus-community/kube-prometheus-stack -n monitoring --create-namespace
What happens:
- Namespace monitoringcreated
- Deployments/DaemonSets/Services for Prometheus, Grafana, Alertmanager, exporters
- Default recording/alerting rules installed
3.3 Verify pods
kubectl get pods -n monitoring
Expect to see pods like:
- prometheus-kube-prometheus-stack-...
- alertmanager-kube-prometheus-stack-...
- grafana-...
- kube-state-metrics-...
- 
prometheus-node-exporter-...(DaemonSet, one per node)
If some are Pending, your cluster may be resource-constrained. Increase Minikube resources or add nodes.
Section 4 — Accessing the UIs (local port-forward)
4.1 Prometheus UI
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-stack-prometheus 9090:9090
Open: http://localhost:9090
4.2 Grafana UI
kubectl port-forward -n monitoring svc/monitoring-grafana 3000:80
Open: http://localhost:3000
Default credentials (from this chart):
- Username: admin
- Password: prom-operator
If login fails, fetch the real password:
kubectl get secret -n monitoring monitoring-grafana -o jsonpath='{.data.admin-password}' | base64 --decode; echo
Section 5 — Prometheus UI & PromQL basics (20 min)
In http://localhost:9090 → Graph tab → Expression box.
Try these (type & Execute):
- All node exporter metrics (discover metric names)
up
- Shows if Prometheus can reach targets (1 = up, 0 = down).
- CPU usage (raw counter)
node_cpu_seconds_total
- A monotonically increasing counter per CPU mode (user/system/idle).
- Counters require rates to become meaningful.
- CPU usage (rate over 5 minutes)
rate(node_cpu_seconds_total[5m])
- Rate = how fast the counter increases.
- You’ll get a series per mode(user/system/idle). Filter mode:
- Idle CPU % (convert to percent)
avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100
- Non-idle CPU %
(1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100
- Memory available %
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
- Disk space used %
100 - (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} * 100)
Tip to teach:
- Counters →
rate()orirate()- Gauges (values go up/down) → use directly (no
rate)
sum by(...),avg by(...)to aggregate over labels- Range selectors
[5m]define the lookback window
Section 6 — Grafana: import high-value dashboards (20 min)
Open http://localhost:3000 → Log in → Dashboards → Import → enter IDs below:
- Node Exporter Full (comprehensive host metrics): 1860
- Kubernetes / Compute Resources / Cluster (comes prepackaged; you’ll see a list in the stack)
After import:
- Choose Prometheus as the data source
- Save
Explore panels (CPU, memory, disk, network). Point out legends and labels.
Section 7 — Create an alert rule (Prometheus rule file) (20 min)
We’ll alert when non-idle CPU > 85% for 2 minutes.
Create cpu-alert.yaml:
groups:
- name: cpu-alerts
  rules:
  - alert: HighCPUUsage
    expr: (1 - avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))) * 100 > 85
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage (>85%) on {{ $labels.instance }}"
      description: "Instance {{ $labels.instance }} CPU usage has been >85% for 2 minutes."
Apply:
kubectl apply -n monitoring -f cpu-alert.yaml
What this does
- Prometheus continuously evaluates the expression
- If it stays true for for: 2m, it fires and sends to Alertmanager
Where does it send? By default, Alertmanager stores & shows active alerts (no external receivers yet). We’ll add Slack next.
Check Prometheus → Alerts tab: you should see HighCPUUsage (inactive yet).
Section 8 — (Optional) Slack notifications (20 min)
8.1 Create a Slack incoming webhook
- In Slack → Apps → Incoming Webhooks → Add to workspace → choose channel → copy Webhook URL (looks like https://hooks.slack.com/services/T000/B000/XXXXXXXX)
8.2 Create Alertmanager config
Create alertmanager-slack.yaml:
alertmanager:
  config:
    route:
      receiver: "slack-default"
      group_by: ["alertname", "instance"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 3h
    receivers:
      - name: "slack-default"
        slack_configs:
          - send_resolved: true
            api_url: "https://hooks.slack.com/services/REPLACE/ME/HERE"
            channel: "#your-alerts-channel"
            title: "{{ .CommonAnnotations.summary }}"
            text: >-
              *Description:* {{ .CommonAnnotations.description }}
              *Status:* {{ .Status }}
              *Labels:* {{ .CommonLabels }}
8.3 Apply via Helm upgrade (best practice)
helm upgrade monitoring prometheus-community/kube-prometheus-stack \
  -n monitoring \
  -f alertmanager-slack.yaml
Wait for Alertmanager pods to roll:
kubectl get pods -n monitoring -w
8.4 Verify in Alertmanager UI (optional)
Port-forward:
kubectl port-forward -n monitoring svc/monitoring-kube-prometheus-stack-alertmanager 9093:9093
Open: http://localhost:9093
You should see the route and receiver.
Section 9 — Trigger the alert (load test) (15–20 min)
We’ll create a CPU-burning pod.
Option A — Busybox burn (simplest)
kubectl run load --image=busybox -- /bin/sh -c "while true; do :; done"
Option B — Stress image (more aggressive)
kubectl run stress --image=ghcr.io/shenxianpeng/stress -- -c 2 -t 600
# -c 2 -> 2 CPU workers, -t 600 -> 10 minutes
Now watch:
- Grafana dashboards (CPU climbing)
- Prometheus → Alerts tab: HighCPUUsage → Pending → Firing
- Slack: notification in your channel (if configured)
Teach: alerts have states: inactive → pending → firing (pending = condition true but not long enough to trigger
for:window).
Clean test pods when done:
kubectl delete pod load stress --ignore-not-found
Section 10 — (Optional) Add your app metric in 5 minutes
  
  
  10.1 Tiny Python web app exposing /metrics
Create app.py:
from flask import Flask
from prometheus_client import Counter, generate_latest, CONTENT_TYPE_LATEST
app = Flask(__name__)
REQUESTS = Counter('app_requests_total', 'Total app requests')
@app.route("/")
def home():
    REQUESTS.inc()
    return "Hello, Prometheus!"
@app.route("/metrics")
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)
Dockerfile:
FROM python:3.11-slim
RUN pip install flask prometheus_client
COPY app.py /app.py
CMD ["python", "/app.py"]
Build & push (replace <yourrepo>):
docker build -t <yourrepo>/prom-app:v1 .
docker push <yourrepo>/prom-app:v1
K8s manifest prom-app.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prom-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prom-app
  template:
    metadata:
      labels:
        app: prom-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "5000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: prom-app
        image: <yourrepo>/prom-app:v1
        ports:
        - containerPort: 5000
---
apiVersion: v1
kind: Service
metadata:
  name: prom-app
spec:
  selector:
    app: prom-app
  ports:
  - port: 5000
    targetPort: 5000
Apply:
kubectl apply -f prom-app.yaml
Why those annotations?
kube-prometheus-stack includes Kubernetes service discovery. The prometheus.io/scrape: "true" annotation tells Prometheus to scrape the pod on that port/path automatically.
Check Prometheus UI → Targets → find prom-app target Up.
Try PromQL:
app_requests_total
rate(app_requests_total[5m])
Generate traffic:
kubectl run curl --image=curlimages/curl -it --rm -- \
  sh -lc 'for i in $(seq 1 100); do curl -s http://prom-app:5000/ >/dev/null; done'
Section 11 — Troubleshooting (common real issues)
- 
Grafana password wrong - Get it from the secret:
 kubectl get secret -n monitoring monitoring-grafana -o jsonpath='{.data.admin-password}' | base64 --decode; echo
- 
Port-forward hangs / connection refused - Ensure service names are correct (kubectl get svc -n monitoring)
- Check pods are Ready:
 kubectl get pods -n monitoring kubectl describe pod <name> -n monitoring
- Ensure service names are correct (
- 
Prometheus “Targets down” - Check network policy (if any), Service/Endpoints exist
- In Prometheus UI → Status → Targets → look at error messages
 
- 
No metrics in Grafana panels - Verify Prometheus is set as data source (Grafana → Connections → Data sources)
- Query directly in Prometheus to confirm metric presence
 
- 
Alerts not reaching Slack - Open Alertmanager UI http://localhost:9093(port-forward svc)
- Check Status → Config for your Slack receiver
- Validate webhook URL & channel
- Confirm alert is Firing in Prometheus
 
- Open Alertmanager UI 
- 
CrashLoopBackOff for exporters - 
kubectl logsanddescribeto check permissions / resources
- Node exporter is a DaemonSet; ensure tolerations if needed on special nodes
 
- 
Section 12 — Cleanup (leave cluster clean)
helm uninstall monitoring -n monitoring
kubectl delete namespace monitoring
kubectl delete pod load stress --ignore-not-found
kubectl delete -f prom-app.yaml --ignore-not-found
Section 13 — Interview crib notes (quick, crisp)
- What is Prometheus? Time-series monitoring: scrapes metrics, stores locally, queried via PromQL; pull model; integrates with Alertmanager.
- Key exporters: node-exporter (host metrics), kube-state-metrics (K8s objects), cAdvisor (containers), blackbox (HTTP/TCP/ICMP probes).
- 
K8s deployment: Helm chart kube-prometheus-stack(bundles Prometheus, Alertmanager, Grafana, rules, dashboards).
- Grafana: dashboards & alerts; Prometheus as data source.
- Alerting: write Prometheus rules; Alertmanager routes to Slack/Email/PagerDuty with grouping & inhibition.
- 
PromQL essentials: rate(),sum by(...),avg by(...), filters with{label="value"}, range selectors[5m].
Section 14 — What students should be able to do (outcomes)
- Explain Prometheus architecture and pull model
- Install kube-prometheus-stack and verify exporters
- Navigate Prometheus UI; write basic PromQL
- Import Grafana dashboards and interpret panels
- Create a simple alert and route it to Slack
- Instrument a toy app with /metricsand see it in Prometheus
 

 
    
Top comments (0)