AttractivePenguin

Posted on Apr 18

Kubernetes Observability in 30 Minutes: Prometheus, Grafana, and Custom Alerts That Actually Tell You Something

#kubernetes #prometheus #grafana #devops

Kubernetes Observability in 30 Minutes: Prometheus, Grafana, and Custom Alerts That Actually Tell You Something

You deployed your app to Kubernetes. Pods are running. Services are up. Everything looks green. But then someone asks: "How do you know it's actually healthy?"

That question hits different when your only answer is kubectl get pods. Running containers ≠ working application. Real observability means you can answer: Are requests succeeding? Is the database slow? Is the cache doing anything? When something breaks at 3 AM, what tells you before your users do?

This tutorial walks through setting up production-grade Kubernetes observability from scratch — Prometheus for metrics collection, Grafana for dashboards, and Alertmanager for notifications — all deployed via GitOps. By the end, you'll have a system that doesn't just show green bars but tells you meaningful stories about your application's health.

Why Observability Matters (And Why dashboards Aren't Enough)

Most Kubernetes tutorials stop at deployment. They show you how to get containers running but skip the part where you actually understand what's happening inside them. CloudWatch shows node metrics. kubectl top shows resource usage. Neither tells you that your cache hit rate dropped to 20% or that database inserts are taking 3x longer than usual.

The difference between monitoring and observability is the difference between a smoke alarm and a dashboard that tells you which room is on fire, how fast it's spreading, and whether the sprinklers are working. You need:

Application metrics — request rates, error rates, latencies, cache performance
Infrastructure metrics — CPU, memory, disk, network per pod and node
Correlation — the ability to see that latency spikes when the cache miss rate climbs
Alerting — proactive notification when things go wrong, not just dashboards you remember to check

Prometheus + Grafana + Alertmanager gives you all four. Let's build it.

Prerequisites

A running Kubernetes cluster (EKS, GKE, AKS, or minikube — any works)
kubectl configured and pointing at your cluster
ArgoCD installed (for GitOps deployment) or willingness to apply manifests directly
A Node.js application to instrument (I'll show the pattern; adapt it to your stack)
Basic familiarity with Kubernetes resources (Deployments, Services, ConfigMaps)

Step 1: Instrument Your Application

Before Prometheus can scrape anything, your app needs to emit metrics. For Node.js, the prom-client library makes this straightforward.

Install the dependency:

npm install prom-client

Create a metrics module (src/metrics.js):

const client = require('prom-client');
const register = new client.Registry();

// Default metrics (GC, event loop, memory, etc.)
client.collectDefaultInfo({ register });

// HTTP request duration histogram
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10],
  registers: [register],
});

// Database query duration histogram
const dbQueryDuration = new client.Histogram({
  name: 'db_query_duration_seconds',
  help: 'Duration of database queries in seconds',
  labelNames: ['operation', 'table'],
  buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5],
  registers: [register],
});

// Cache operations counter
const cacheOperations = new client.Counter({
  name: 'cache_operations_total',
  help: 'Total cache operations',
  labelNames: ['operation', 'result'],
  registers: [register],
});

// Active database connections gauge
const dbConnectionsActive = new client.Gauge({
  name: 'db_connections_active',
  help: 'Number of active database connections',
  registers: [register],
});

module.exports = {
  register,
  httpRequestDuration,
  dbQueryDuration,
  cacheOperations,
  dbConnectionsActive,
};

Add HTTP middleware to track every request (src/middleware.js):

const { httpRequestDuration } = require('./metrics');

function metricsMiddleware(req, res, next) {
  // Exclude the /metrics endpoint itself from tracking
  if (req.path === '/metrics') return next();

  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({
      method: req.method,
      route: req.route?.path || req.path,
      status_code: res.statusCode,
    });
  });
  next();
}

module.exports = { metricsMiddleware };

Track database and cache operations wherever they occur:

const { dbQueryDuration, cacheOperations, dbConnectionsActive } = require('./metrics');

// Database wrapper
async function query(operation, table, fn) {
  const end = dbQueryDuration.startTimer({ operation, table });
  try {
    const result = await fn();
    return result;
  } finally {
    end();
  }
}

// Cache tracking
async function cacheGet(key) {
  const result = await redis.get(key);
  cacheOperations.inc({
    operation: 'get',
    result: result !== null ? 'hit' : 'miss',
  });
  return result;
}

async function cacheSet(key, value) {
  await redis.set(key, value);
  cacheOperations.inc({ operation: 'set', result: 'success' });
}

Expose the metrics endpoint (src/server.js):

const { register } = require('./metrics');

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Hit http://localhost:3000/metrics and you should see ~100 lines of Prometheus-format metrics: counters, histograms, gauges, all labeled and ready to scrape.

Step 2: Deploy the Observability Stack

The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Alertmanager, Node Exporter, and kube-state-metrics — everything you need in one deploy.

If you're using ArgoCD, create an Application manifest (monitoring/argocd-app.yaml):

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: monitoring
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://prometheus-community.github.io/helm-charts
    chart: kube-prometheus-stack
    targetRevision: "58.0.0"
    helm:
      values: |
        grafana:
          service:
            type: LoadBalancer
          adminPassword: admin
        prometheus:
          prometheusSpec:
            retention: 7d
            resources:
              requests:
                memory: 512Mi
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Or install directly with Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
kubectl create namespace monitoring
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.service.type=LoadBalancer \
  --set grafana.adminPassword=admin \
  --set prometheus.prometheusSpec.retention=7d

Wait 2-3 minutes for all pods to come up:

kubectl get pods -n monitoring -w

You should see: prometheus-monitoring-0, monitoring-grafana-..., alertmanager-monitoring-0, and several exporters.

Step 3: Connect Prometheus to Your Application

This is where most people get stuck. The instinct is to use additionalScrapeConfigs in the Helm values. Don't. The correct approach is a ServiceMonitor — a CRD that the Prometheus Operator watches to discover scrape targets automatically.

First, make sure your application's Service has a named port:

apiVersion: v1
kind: Service
metadata:
  name: gitops-api
  namespace: three-tier
  labels:
    app: gitops-api
spec:
  selector:
    app: gitops-api
  ports:
    - name: http          # This name matters!
      port: 80
      targetPort: 3000

Then create the ServiceMonitor (monitoring/servicemonitor.yaml):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: gitops-api
  namespace: monitoring
  labels:
    release: monitoring   # Must match your Helm release label
spec:
  selector:
    matchLabels:
      app: gitops-api
  namespaceSelector:
    matchNames:
      - three-tier
  endpoints:
    - port: http
      path: /metrics
      interval: 15s

The release: monitoring label is critical — the Prometheus Operator uses it to discover ServiceMonitors. If this label doesn't match your Helm release name, Prometheus silently ignores your ServiceMonitor and you'll spend hours debugging why targets aren't showing up.

Apply it:

kubectl apply -f monitoring/servicemonitor.yaml

Verify the target is discovered in Prometheus: open the Prometheus UI (kubectl port-forward svc/prometheus-operated 9090 -n monitoring) and navigate to Status → Targets. You should see your application endpoint with state "UP".

Step 4: Build Meaningful Dashboards

Skip the pretty-but-useless dashboards. Build ones that tell a story. Here's a layout that covers the three layers that matter:

Row 1 — HTTP Layer (Is the API serving traffic?)

{
  "title": "Request Rate",
  "type": "timeseries",
  "targets": [
    { "expr": "sum(rate(http_request_duration_seconds_count[5m])) by (route)", "legendFormat": "{{route}}" }
  ]
}

Add panels for Error Rate (rate(...{status_code=~"5.."}) / rate(...) as percentage) and P95 Latency (histogram_quantile(0.95, rate(..._bucket[5m])) by route).

Row 2 — Data Layer (Is the backend keeping up?)

Requests by Status (pie chart: 200/201/404/503 distribution)
Cache Hit/Miss Ratio (pie chart: cache_operations_total{operation="get",result="hit"} vs miss)
DB Query Duration P95 (by operation type — inserts vs selects)
Active DB Connections (gauge, 0–10 scale, yellow at 6, red at 8)

Row 3 — Infrastructure (Do we need more resources?)

DB Queries per Second (insert and select rates)
Pod Memory Usage
Pod CPU Usage

Save the dashboard JSON as a ConfigMap and deploy it via ArgoCD so it's version-controlled and reproducible:

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  app-dashboard.json: |
    {YOUR_DASHBOARD_JSON_HERE}

Grafana's sidecar automatically picks up ConfigMaps with the grafana_dashboard label and imports them.

Step 5: Set Up Alerts That Don't Cry Wolf

Nine custom alert rules across four categories — enough to catch real problems without paging you at 3 AM for a blip:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: app-alerts
  namespace: monitoring
  labels:
    release: monitoring
spec:
  groups:
    - name: api-health
      rules:
        - alert: HighErrorRate
          expr: |
            sum(rate(http_request_duration_seconds_count{status_code=~"5.."}[5m]))
            / sum(rate(http_request_duration_seconds_count[5m])) > 0.05
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Error rate above 5%"
            description: "API error rate is {{ $value | humanizePercentage }}"

        - alert: HighLatency
          expr: |
            histogram_quantile(0.95,
              sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route)
            ) > 2
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "P95 latency above 2s"
            description: "Route {{ $labels.route }} P95 latency is {{ $value }}s"

    - name: database
      rules:
        - alert: SlowDatabaseQueries
          expr: |
            histogram_quantile(0.95,
              sum(rate(db_query_duration_seconds_bucket[5m])) by (le, operation)
            ) > 0.5
          for: 10m
          labels:
            severity: warning

        - alert: ConnectionPoolExhaustion
          expr: db_connections_active > 8
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "DB connection pool nearing exhaustion"

    - name: cache
      rules:
        - alert: LowCacheHitRate
          expr: |
            sum(rate(cache_operations_total{operation="get",result="hit"}[5m]))
            / sum(rate(cache_operations_total{operation="get"}[5m])) < 0.5
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Cache hit rate below 50%"

    - name: pods
      rules:
        - alert: PodCrashLooping
          expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
          for: 5m
          labels:
            severity: critical

        - alert: PodMemoryPressure
          expr: |
            container_memory_working_set_bytes
            / container_spec_memory_limit_bytes > 0.85
          for: 10m
          labels:
            severity: warning

Key design principles for alerts:

Always use for: clauses. A 30-second spike shouldn't wake you up. for: 5m means the condition must persist for 5 minutes before alerting.
Thresholds should be meaningful. 5% error rate is objectively bad. 2-second P95 latency means real users are suffering. Don't alert on 1% error rates — you'll burn out on noise.
Group by severity. critical = page someone now. warning = investigate in the morning.

Step 6: Configure Alertmanager Routing

Alertmanager decides where alerts go. A basic config that routes critical alerts to Slack and warnings to email:

apiVersion: v1
kind: Secret
metadata:
  name: alertmanager-monitoring-config
  namespace: monitoring
type: Opaque
stringData:
  alertmanager.yaml: |
    route:
      receiver: slack
      group_by: [alertname, severity]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      routes:
        - match:
            severity: critical
          receiver: slack-urgent
          repeat_interval: 1h
        - match:
            severity: warning
          receiver: email
    receivers:
      - name: slack
        slack_configs:
          - api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
            channel: '#monitoring'
      - name: slack-urgent
        slack_configs:
          - api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
            channel: '#incidents'
      - name: email
        email_configs:
          - to: oncall@yourcompany.com
            from: alertmanager@yourcompany.com
            smarthost: smtp.yourcompany.com:587

Real-World Scenarios

Scenario 1: The Silent Degradation

Your cache hit rate slowly drops from 85% to 40% over two hours. No pods crash. No 500 errors. But database query latency triples because every request now hits Postgres instead of Redis. The LowCacheHitRate alert catches this 15 minutes in, long before users notice.

Scenario 2: The Connection Leak

A new deploy introduces a database connection leak. Active connections climb from 3 to 8 over 10 minutes. The ConnectionPoolExhaustion alert fires at 8 connections, you roll back before the pool hits 10 and the app becomes unresponsive.

Scenario 3: The Noisy Neighbor

Another team deploys a memory-hungry job on the same node. Your pod's memory pressure crosses 85%. The PodMemoryPressure warning gives you time to request a node with more capacity or move the workload before OOMKill hits.

FAQ / Troubleshooting

Q: Prometheus shows my target as "DOWN" or missing entirely.
A: Check three things: (1) Does your Service have a named port (name: http, not just port: 80)? ServiceMonitors reference ports by name. (2) Does your ServiceMonitor have the release label matching your Helm release? (3) Is the namespace in namespaceSelector.matchNames correct?

Q: My ServiceMonitor exists but metrics aren't appearing in Grafana.
A: Go to Prometheus UI → Status → Targets. If the target isn't listed, the ServiceMonitor isn't being picked up. Check label selectors. If it's listed but showing errors, the /metrics endpoint might not be reachable from the cluster network.

Q: Alerts aren't firing.
A: Check Prometheus UI → Alerts. Are the rules loaded? Is the expression evaluating? Test your PromQL directly in the query bar. Common mistake: metric names with typos or label mismatches.

Q: Dashboard shows "No data."
A: Verify the data source is configured (Grafana → Settings → Data Sources → Prometheus). Check that the namespace in your query matches where the metrics are. Use {namespace="three-tier"} to scope queries.

Q: How much resources does this stack need?
A: For a small cluster (< 20 nodes), budget: Prometheus 512Mi-1Gi RAM, Grafana 256Mi, Alertmanager 128Mi. The Helm chart defaults are reasonable for dev/test. Increase retention and memory for production.

Conclusion

Observability isn't a nice-to-have — it's the difference between guessing and knowing. The combination of application-level metrics (prom-client), infrastructure metrics (node exporter, kube-state-metrics), powerful querying (PromQL), visualization (Grafana), and proactive alerting (Alertmanager) gives you a complete picture of your system's health.

The whole stack deploys via GitOps — push to main, ArgoCD syncs. No manual dashboard creation. No ad-hoc alert rules. Everything version-controlled, reproducible, and auditable.

Start with the metrics that answer "is it working?" — request rate, error rate, latency. Add depth from there: cache performance, database health, resource pressure. Let the alerts do the watching so you don't have to.

Your future self — the one getting paged at 3 AM — will thank you.

DEV Community

Kubernetes Observability in 30 Minutes: Prometheus, Grafana, and Custom Alerts That Actually Tell You Something

Kubernetes Observability in 30 Minutes: Prometheus, Grafana, and Custom Alerts That Actually Tell You Something

Why Observability Matters (And Why dashboards Aren't Enough)

Prerequisites

Step 1: Instrument Your Application

Step 2: Deploy the Observability Stack

Step 3: Connect Prometheus to Your Application

Step 4: Build Meaningful Dashboards

Step 5: Set Up Alerts That Don't Cry Wolf

Step 6: Configure Alertmanager Routing

Real-World Scenarios

FAQ / Troubleshooting

Conclusion

Top comments (0)