augusthottie

Posted on Mar 25

I Added Prometheus, Grafana, and Custom Alerting to My EKS Cluster, Here's How Observability Actually Works

#kubernetes #prometheus #grafana #monitoring

After building three projects: a CI/CD pipeline, a 3-tier architecture, and GitOps on EKS, I had one obvious gap: observability. I could deploy things, but I couldn't answer "is it healthy?" beyond checking if pods were running.

"How do you monitor your services?" is an interview question I wasn't ready for. I'd used Grafana dashboards other people built. I'd looked at CloudWatch metrics someone else configured. But I'd never instrumented an application, written PromQL queries, or set up alerting rules from scratch.

So I did all of it.

What I Built

I took the GitOps EKS project from last week and added a complete observability layer:

Instrumented the Node.js API with prom-client, 7 custom metrics covering HTTP requests, database queries, cache operations, and connection pools
Deployed kube-prometheus-stack (Prometheus, Grafana, Alertmanager, Node Exporter, kube-state-metrics) via ArgoCD
Built a 9-panel Grafana dashboard showing request rate, error rate, latency, cache hit/miss ratio, DB query performance, and pod resources
Wrote 9 custom alert rules across API health, database performance, cache efficiency, and pod stability

Everything deployed via GitOps, push to main, ArgoCD syncs the monitoring stack and custom configs.

Instrumenting the Application

The first step was making the app emit metrics. I added prom-client and created a metrics module with:

HTTP middleware that wraps every request, tracking method, route, status code, and duration. The /metrics endpoint itself is excluded so Prometheus scraping doesn't inflate the numbers.

Database helpers that time every query and track success/failure by operation type (select, insert, delete). This means I can see not just "is the database slow?" but "are inserts slower than selects?"

Cache tracking on every Redis operation, get (hit or miss), set, and invalidate. This shows whether the caching layer is actually working or if every request hits the database.

Connection pool gauge that samples active database connections every 5 seconds. When this approaches the pool limit (10), something is holding connections open.

The /metrics endpoint exposes everything in Prometheus text format. Hit it once and you get counters, histograms, and gauges, about 100 lines of metrics per scrape.

ServiceMonitor: The Right Way to Scrape

My first attempt used additionalScrapeConfigs in the Prometheus values, a raw scrape config injected into the Prometheus config. It didn't work. The operator didn't pick it up, and debugging why was a dead end.

The correct approach is a ServiceMonitor — a Kubernetes CRD that tells the Prometheus operator what to scrape. It uses label selectors to find Services and endpoints automatically. Mine looks for any Service with app: gitops-api in the three-tier namespace, scrapes port http on path /metrics every 15 seconds.

One detail that took me longer than I'd like to admit: the Service needs a named port. Not just port: 80 but name: http, port: 80. The ServiceMonitor references the port by name, and without it, Prometheus silently ignores the target.

The Dashboard

I built the dashboard as a JSON ConfigMap deployed via ArgoCD. Nine panels in three rows:

Row 1: HTTP layer:
Request Rate (per route), Error Rate (percentage of 5xx responses), P95 Latency (per route). These tell you if the API is serving traffic, how much is failing, and how fast.

Row 2: Data layer:
Requests by Status (pie chart showing 200/201/404/503 distribution), Cache Hit/Miss (pie chart — green means Redis is working), DB Query Duration (p95 for inserts vs selects), DB Active Connections (gauge per pod, 0-10 scale with yellow at 6, red at 8).

Row 3: Infrastructure:
DB Queries per Second (insert and select rates), Pod Memory Usage, Pod CPU Usage. These show whether the workload needs more resources.

The moment all 9 panels lit up with real data was genuinely satisfying. The error rate panel showed a real 503 spike from when PostgreSQL was still starting, that's not test data, that's the actual system behavior captured in metrics.

The PromQL Behind Each Panel

For anyone building their own dashboard, here are the exact queries:

Request Rate: requests per second, broken down by route:

sum(rate(http_requests_total[5m])) by (route)

Error Rate: percentage of 5xx responses:

sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ sum(rate(http_requests_total[5m])) * 100

P95 Latency: 95th percentile response time per route:

histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, route))

Cache Hit/Miss: Redis cache effectiveness over the last hour:

sum(increase(cache_operations_total{operation="get"}[1h])) by (result)

DB Query Duration (p95): insert vs select latency:

histogram_quantile(0.95, 
  sum(rate(db_query_duration_seconds_bucket[5m])) by (le, operation))

DB Queries per Second: operation throughput:

sum(rate(db_queries_total[5m])) by (operation)

Pod Memory: working set per container:

container_memory_working_set_bytes{namespace="three-tier", container!=""}

Pod CPU: usage rate per pod:

sum(rate(container_cpu_usage_seconds_total{namespace="three-tier", container!=""}[5m])) by (pod)

The key thing to understand: rate() calculates per-second averages over a window, while increase() gives you the total count over a window. Use rate() for time-series graphs, increase() for pie charts and totals. And histogram_quantile() is how you get percentiles from histogram buckets, you can't just average latency and get useful numbers.

Alert Rules

I wrote 9 PrometheusRules covering the scenarios I'd actually want to be woken up for:

API Health: Error rate above 5% (critical), P95 latency above 1 second (warning), metrics endpoint unreachable (critical). The error rate alert uses a 2-minute for duration so a single failed request doesn't trigger it.

Database: Query error rate above 1% (critical), P95 query time above 500ms (warning), connection pool above 8/10 active (warning). The connection pool alert is the early warning — if you're at 80% capacity, the next traffic spike will exhaust it.

Cache: Miss rate above 80% for 10 minutes (warning). A high miss rate means either Redis is down, the cache TTL is too short, or the data is never being cached. The 10-minute window avoids alerting during cold starts.

Pods: Crash-looping (more than 3 restarts in 15 minutes), memory above 85% of limit. Crash loops are critical because they mean the service is fundamentally broken. Memory warnings give you time to increase limits before OOMKills start.

The Debugging That Taught Me the Most

The OIDC mismatch. I reused the EKS Terraform from Project 3, but the EBS CSI driver's IAM role still had the old cluster's OIDC provider URL. Every AssumeRoleWithWebIdentity call failed with AccessDenied, but the error doesn't say "wrong OIDC provider", it just says "not authorized." I had to compare the role's trust policy against the current cluster's OIDC issuer to find the mismatch.

The empty service account annotation. After reinstalling the AWS Load Balancer Controller with Helm, the service account had eks.amazonaws.io/role-arn: "", an empty string instead of the actual ARN. The controller fell back to the node role, which didn't have ELB permissions. A kubectl annotate --overwrite fixed it, but I only found it by checking the SA's YAML directly.

ServiceMonitor vs additionalScrapeConfigs. I spent time trying to make additionalScrapeConfigs work before learning that the Prometheus operator intentionally manages config through CRDs. ServiceMonitor is the right abstraction, it's declarative, it uses label selectors, and the operator reconciles it automatically.

Why This Matters for Interviews

Before this project, my monitoring answer was "we used Grafana and Prometheus." Now I can explain:

How to instrument an application with counters (requests), histograms (latency), and gauges (connections)
Why rate() over increase() for dashboards, and why histogram_quantile() for latency percentiles
How ServiceMonitors work with the Prometheus operator for service discovery
What alerts are worth setting up and why each threshold was chosen
How to deploy and manage a monitoring stack via GitOps

When an interviewer asks "how would you know if your service is having issues?", I have a 9-panel dashboard and 9 alert rules to point to.

DEV Community