Mehmet TURAÇ

Posted on Jun 10

Great Stack to Doesn't Work #7 — Observability: "400 Dashboards, Zero Insight"

#observability #devops #monitoring #discuss

Great Stack to Doesn't Work #7

Observability: "400 Dashboards, Zero Insight"

A survival guide for when everything goes wrong in production.

You have Grafana. You have Prometheus. You have Loki. You have 400 dashboards, 2,300 alert rules, and a PagerDuty integration that fires so often the on-call engineer keeps the phone on silent.

Your observability stack is complete. You've never been more blind.

The problem isn't the tools. The problem is that you're measuring everything and understanding nothing.

Prometheus: Naming Conventions and the Cardinality Trap

Prometheus is a time-series database that scrapes metrics from your services. It's simple, powerful, and will fill your disk in 48 hours if you don't understand cardinality.

Naming conventions matter. A metric name should tell you what it measures without reading documentation.

Bad:

requests_total
db_time
errors

Good:

http_requests_total{method="GET", handler="/api/orders", status="200"}
database_query_duration_seconds{query_type="select", table="orders"}
http_errors_total{method="POST", handler="/api/checkout", error_type="timeout"}

The pattern: <namespace>_<name>_<unit>. Use _total for counters, _seconds for durations, _bytes for sizes. Include meaningful labels but keep them bounded.

Cardinality is the silent killer. Every unique combination of metric name + label values creates a separate time series. If you have a metric with labels {user_id, endpoint, status_code}, and you have 1 million users, 50 endpoints, and 10 status codes, you've just created 500 million time series. Prometheus will slow down, consume enormous memory, and eventually crash.

Rules:

Never use unbounded labels: user IDs, request IDs, email addresses, IP addresses. These create infinite cardinality.
Keep label values to a bounded set: HTTP methods (7 values), status code classes (5 values: 2xx, 3xx, 4xx, 5xx, unknown), service names (dozens, not thousands).
Use recording rules to pre-aggregate high-cardinality data into lower-cardinality summaries.

# Recording rule: pre-aggregate request rate by handler
groups:
  - name: aggregations
    rules:
      - record: handler:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (handler)

Recording rules compute and store the aggregation, so dashboards and alerts query the pre-computed result instead of scanning raw data.

The cardinality explosion story: A team added a trace_id label to their request duration metric "for debugging." Each request got a unique trace ID. Within 24 hours, Prometheus had 40 million active time series. Memory usage hit 60 GB. Queries that took 200ms started taking 45 seconds. The monitoring system designed to detect outages was itself causing an outage.

Fix: remove the label, restart Prometheus, wait for compaction. Investigation time: 4 hours. They'd added the label with a one-line change and no review.

Grafana: Fewer Dashboards, More Signal

Having 400 dashboards means nobody knows which one to look at during an incident. When the pager fires at 3 AM, the on-call engineer opens Grafana and faces a wall of dashboards. Which one shows the problem? They click through 5, then 10, then 15, and by the time they find the relevant graph, 20 minutes have passed.

The dashboard hierarchy:

Level 1: The Overview (1 dashboard per service). Red/green health status. Request rate, error rate, latency P50/P99, saturation (CPU, memory, connections). This is the dashboard the on-call engineer opens first. If something is red here, they drill down.

Level 2: The Drill-Down (3-5 dashboards per service). Database performance. Cache performance. Dependency health. Queue depth. These answer "where is the problem?" after Level 1 told you "there IS a problem."

Level 3: The Deep Dive (as many as needed, but rarely opened). Individual query performance. Per-endpoint latency breakdowns. GC statistics. Thread pool utilization. These exist for specific investigations, not routine monitoring.

A service with 3 levels needs about 8-10 dashboards total. A platform with 15 services needs 120-150 dashboards. If you have 400, you have dashboard sprawl — dashboards nobody owns, nobody updates, and nobody trusts.

The team that cut 400 to 35: They audited every dashboard. For each: Who created it? When was it last viewed? Does it answer a question that another dashboard already answers? 280 dashboards hadn't been viewed in 6 months. 85 were duplicates or near-duplicates. They deleted them all, reorganized the remaining into the three-level hierarchy, and the on-call team's mean time to detection dropped by 40%. Not because the monitoring improved — the tools were identical. The signal-to-noise ratio improved.

Loki: Log Aggregation Done Right

Loki is "like Prometheus, but for logs." It indexes metadata (labels) and stores log content as compressed chunks. This makes it cheap to store and fast to query by label, but slow to query by full-text content.

Structured logging is non-negotiable. If your logs look like this:

2026-06-25 14:23:01 ERROR Failed to process order 12345 for user john@example.com: connection timeout

Parsing this requires regex. Regex breaks when someone changes the log format. Now multiply this by 50 services, each with slightly different log formats.

Structured logging:

{
  "timestamp": "2026-06-25T14:23:01Z",
  "level": "error",
  "service": "order-processor",
  "message": "Failed to process order",
  "order_id": 12345,
  "error_type": "connection_timeout",
  "downstream_service": "payment-gateway",
  "duration_ms": 5023
}

Every field is queryable. In LogQL (Loki's query language):

{service="order-processor"} | json | error_type="connection_timeout" | duration_ms > 5000

This finds all connection timeouts in the order processor that took over 5 seconds. No regex. No guessing. Structured data, structured queries.

Log levels matter. Use them consistently:

ERROR: something broke and needs attention. Don't use this for expected failures like 404s.
WARN: something is unusual but the system handled it. Connection retry succeeded. Cache miss fell through to database.
INFO: significant business events. Order placed. User signed up. Payment processed. Keep these sparse.
DEBUG: internal state useful for development. Never enable in production unless actively investigating an issue, and turn it off when done.

If your production logs are 90% DEBUG-level noise, you're paying for storage and making it harder to find the signal.

Alert Fatigue: When Everything Is Critical, Nothing Is

Alert fatigue is the #1 operational risk that nobody measures. When on-call engineers receive 50 alerts per shift, they develop coping mechanisms: ignore, mute, snooze. When alert #51 is a real outage, it gets the same treatment.

The symptoms:

On-call acknowledges alerts without investigating.
Alerts are silenced "temporarily" and never unsilenced.
Engineers say "oh, that alert always fires, just ignore it."
Mean time to response (MTTR) increases over time even though the tools improve.

The fix: alert on symptoms, not causes.

Bad alert: "CPU usage > 80% for 5 minutes." CPU at 80% is a cause. What's the symptom? Maybe nothing. Maybe the application handles it fine. Maybe latency is still within SLA.

Good alert: "P99 latency > 500ms for 5 minutes." This is a symptom users experience. It doesn't matter whether the cause is CPU, memory, a slow query, or a downstream service. The user is impacted.

Alert classification:

Page (wake someone up): User-facing impact. Error rate > 1%. Latency P99 > SLA. Service completely down. Payment failures.

Ticket (handle during business hours): Disk usage > 80%. Certificate expires in 14 days. Consumer lag growing. These are important but not urgent.

Dashboard only (no notification): CPU spikes. GC pauses. Connection pool utilization. These are diagnostic data, not actionable alerts. They belong on dashboards, not in PagerDuty.

One team reduced their alerts from 2,300 to 180 using this classification. Pages dropped from 50 per week to 8. Every page was actionable. MTTR dropped from 25 minutes to 8 minutes because engineers trusted the alerts again.

Retention: How Long to Keep What

Metrics and logs are expensive to store. Infinite retention sounds nice until you see the storage bill.

Metrics retention strategy:

Raw metrics (full resolution): 15-30 days. This is what you query during active incidents and recent investigations.
Downsampled metrics (5-minute averages): 6-12 months. Good enough for trend analysis and capacity planning.
Aggregated metrics (hourly/daily): 2+ years. Business reporting and year-over-year comparisons.

Prometheus itself isn't great at long-term storage. Use Thanos or Cortex for tiered retention with downsampling.

Log retention strategy:

Hot logs (Loki, Elasticsearch): 14-30 days. Searchable, fast.
Cold logs (S3, GCS): 90 days to 1 year. Archived, slower to query, much cheaper.
Beyond 1 year: only keep if compliance requires it.

The rule: keep what you'll actually query. If nobody has looked at 90-day-old metrics in a year, 90 days of retention is wasted money.

OpenTelemetry: The Convergence

Before OpenTelemetry, metrics came from Prometheus client libraries, traces came from Jaeger or Zipkin SDKs, and logs came from whatever logging library your language uses. Three separate instrumentation systems. Three sets of libraries. Three ways to correlate data (or not).

OpenTelemetry (OTel) unifies all three. One SDK. One collector. One set of semantic conventions.

Application → OTel SDK → OTel Collector → {Prometheus, Jaeger, Loki}

The value isn't in the collector — it's in correlation. When a trace, a metric, and a log share the same trace ID, you can click from a spike on a Grafana dashboard to the exact trace that caused it, to the exact log line where the error occurred.

Without correlation, debugging is: "I see an error spike at 14:23. Let me search logs around 14:23 for errors. Here are 500 errors. Which one caused the spike?" With correlation: "I see an error spike at 14:23. Here's the exemplar trace. Here's the failing span. Here's the log line."

OTel adoption in 2026 is at the point where if you're starting a new project and NOT using it, you need a reason.

SLI/SLO/SLA: Error Budgets in Practice

SLI (Service Level Indicator): The metric you measure. "Percentage of requests completed in under 300ms."

SLO (Service Level Objective): The target you set internally. "99.9% of requests will complete in under 300ms over a rolling 30-day window."

SLA (Service Level Agreement): The contractual promise to customers. Usually looser than the SLO. "99.5% availability."

Error budget: The difference between 100% and your SLO. If your SLO is 99.9%, your error budget is 0.1% — roughly 43 minutes of downtime per month.

The power of error budgets is in the decisions they enable:

Error budget remaining > 50%: Deploy freely. Experiment. Take risks. You can afford failures.
Error budget remaining 10-50%: Proceed carefully. Canary deployments. Smaller batches.
Error budget exhausted: Freeze feature deployments. Focus entirely on reliability. No new features until the error budget regenerates.

This replaces subjective arguments ("I think we should slow down") with data-driven decisions ("our error budget is at 8%, we're freezing deploys until it recovers").

The hardest part isn't the math. It's getting product and engineering leadership to agree that when the error budget is gone, reliability takes priority over features. The teams that actually enforce this have significantly fewer incidents than the ones that treat SLOs as aspirational.

War Story: The Alert That Cried Wolf

An e-commerce platform. 180 alert rules. On-call rotation of 6 engineers. Average: 12 pages per day. Most pages were "CPU > 80%" or "memory > 85%" on one of 40 servers. Engineers would check, see that request latency was normal, and dismiss.

On a Tuesday, at 14:15, the same "CPU > 80%" alert fired on 3 servers simultaneously. The on-call engineer dismissed it — same alert, same as always. At 14:25, the first customer complaints arrived. At 14:32, the error rate hit 15%. The incident lasted 47 minutes.

Root cause: a downstream API changed its response format. The deserialization code entered a retry loop that consumed CPU. The "CPU > 80%" alert was technically correct — CPU was the symptom. But because that alert fired constantly for benign reasons, nobody investigated.

After the incident:

Deleted all CPU and memory threshold alerts.
Created symptom-based alerts: error rate, latency, throughput deviation from baseline.
Moved infrastructure metrics to dashboards only — visible during investigation, never paging.
Daily alert pages dropped from 12 to 2. Both were actionable. MTTR improved from 35 minutes to 11 minutes.

The monitoring stack didn't change. Not a single tool was added or removed. The change was philosophical: stop alerting on infrastructure metrics, start alerting on user impact.

Key Takeaways

Observability is not a stack. It's a practice. The tools are a prerequisite, not the solution.

Fewer dashboards, but the right dashboards. Fewer alerts, but alerts that mean something. Structured logs that can be queried, not free-form strings that need regex.

Cardinality will destroy your Prometheus if you don't think about it before adding labels. Recording rules are not optional — they're how you keep queries fast.

And if your on-call engineer has learned to ignore your alerts, your monitoring is worse than useless — it's actively harmful because it creates a false sense of being monitored.

Over to You

How many dashboards does your team actually use? Have you experienced alert fatigue — and how did you fix it?

If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

DEV Community